Metal: Blazing Fast Image Processing

Metal allows developers to perform graphics rendering and parallel computing directly on the GPU. Before Metal, OpenGL ES was the framework available on iOS to perform graphics rendering. Instead, parallel computing using the GPU was only available on OS X through OpenCL. Recently, Apple brought Metal also to OS X. Apple also extended Metal with new features and two new companion frameworks: MetalKit and Metal Performance Shaders.

In this post, I am going to show you how to use Metal to perform parallel processing directly on the GPU. As an example I am going to process an image using a custom filter. This is something you can also do with Core Image. Actually, I recommend you to use Core Image instead of Metal for different reasons. First, Core Image is simpler to setup. Additionally, Core Image offers the possibility to build your custom filters. Starting from iOS 9, Core Image was fully rewritten to take advantage of Metal. However, since you can use Metal to perform any kind of processing (for example, audio processing), I want to show you how to setup and create Metal-based projects.

Metal is only available on the device. So, the following examples and source code cannot be compiled for the iOS simulator.

Metal vs OpenGL/OpenCL

The main advantage of Metal with respect of OpenGL and OpenCL is that Metal was completely built by Apple taking advantage of the Apple hardware. While OpenGL and OpenCL are multi-platform, Metal works only with iOS, OS X and tvOS devices. Writing Metal from scratch allowed Apple to perform massive optimizations almost impossible with OpenGL and OpenCL.

Before we start, I am adding here a small schema showing the components we are going to put together and how they are organized.


Device and GPU

While using Metal you perform the processing and rendering operations directly on the GPU. In Metal, the GPU is represented by an object that conforms to the MTLDevice protocol. As you will see in the rest of this post, almost every Metal functionality is offered by the framework through protocols rather than concrete classes.

In iOS and OS X, you create a metal object using the MTLCreateSystemDefaultDevice() function. In OS X, you can also select the metal device from a list of devices returned by the MTLCopyAllDevices() function.

If you are using Swift, MTLCreateSystemDefaultDevice() returns an optional. So, please, handle it correctly.

Once you have the Metal device, you can use it to create different kinds of non-transient objects: command queues, textures and data buffers, sampler states, compute states, function libraries, and pipeline states. I am going to describe in this post many of these objects.

Since it can be expensive to create these non-transient objects, you should create them as soon as possible and reuse them throughout the lifetime of your app. You must avoid to create these objects in performance sensitive code.

Other types of Metal objects (command buffers and command encoders) are instead lightweight objects and must be created and consumed immediately.

Command Queue

After creating a Metal device, you need to create a command queue. A command queue is a queue of command buffers and organizes the order of execution of these command buffers. A command queue is a non-transient object. You create a single command queue per device and keep it during the lifetime of your app.

To create a command queue, call one of the following methods on a MTLDevice object:

For example,

These methods create an object that conforms to the MTLCommandQueue protocol. In general, command queues are thread-safe and allow multiple outstanding command buffers to be encoded simultaneously.

Command Buffers

Command buffer objects are created by the MTLCommandQueue object and can only be committed for execution on the command queue that created it. All command buffers sent to a single command queue are guaranteed to execute in the order in which the command buffers were enqueued.

You use the commandBuffer() method on the command queue object to create a command buffer that holds a strong reference to any objects that are needed to finish executing the commands encoded in the command buffer:

Command buffers are lightweight, single-use objects that store encoded commands that are eventually committed for execution by the GPU. You create command buffers when needed. After a command buffer has been committed for execution, the only valid operations on the command buffer are to wait for it to be scheduled or completed (using synchronous calls or handler blocks) and to check the status of the command buffer execution. When used, scheduled and completed handlers are blocks that are invoked in execution order. These handlers should perform quickly; if expensive or blocking work needs to be scheduled, defer that work to another thread.

In a multithreaded app, it’s advisable to break your overall task into subtasks that can be encoded separately. You create a command buffer for each chunk of work, then you call the enqueue() method on these command buffer objects to establish the order of execution.

Command Encoders

You use the command buffer to create a command encoder object and fill the buffer with commands. The MTLCommandEncoder protocol defines the common interface for objects that can write commands into a command buffer.

In Metal, there are many different kinds of command encoders, each providing a different set of commands that can be encoded into the buffer:

  • Compute Command Encoder: an object conforming to the MTLComputeCommandEncoder protocol and used to encode computational tasks.
  • Render Command Encoder: an object conforming to the MTLRenderCommandEncoder protocol and used to encode graphics rendering tasks.
  • Blit Command Encoder: an object conforming to the MTLBlitCommandEncoder protocol and used for memory management tasks.
  • Parallel Render Command Encoder: an object conforming to the MTLParallelRenderCommandEncoder protocol and used for multiple graphics rendering tasks encoded in parallel.

In this post, we will focus only on the compute command encoders. In general, you can create different types of command encoders for each command buffer. In our case, we can create a compute command encoder from the command buffer:

Once you have the command encoder, you can use it to encode the following data-parallel compute processing commands:

  1. setComputePipelineState(_:) passing it the MTLComputePipelineState object (see Compute Pipeline State) that contains the compute function that will be executed.
  2. setTexture(_:at:) and setBuffer(_:at:) to specify the resources that hold the input data (or output destination) for the compute function. The index represents the location of each resource in the corresponding argument table. See later Resources and Data.
  3. dispatchThreadgroups(_: threadsPerThreadgroup:) to encode the compute function with a specified number of thread groups (see later Threadgroups) for the grid and the number of threads per threadgroup.
  4. Finally, endEncoding() to finish encoding the compute commands onto the command buffer.

The following source code shows an example of how to use a compute command encoder:

To fully understand this chunk of code we need to introduce additional concepts:

  • Compute Pipeline States
  • Libraries and Functions
  • Threadgroups
  • Resources and data

Let’s give a look at each of them. I know, there are too many objects. However, you write all this boilerplate code only once. Instead, you will spend most of your time in writing Metal functions (or shaders).

Compute Pipeline State

The MTLComputePipelineState protocol defines the interface for a lightweight object used to encode a reference to a compiled compute program. A compiled compute program is a set of C++ functions that we use to process buffers and textures.

A MTLComputePipelineState object is fully thread-safe and can be used by many MTLComputeCommandEncoder objects, even if they are associated with different command buffers.

Because creating a compute pipeline state object can be expensive, you usually create these pipeline state objects during the initialization of your app and reuse them throughout its lifetime.

You create a compute pipeline state using the Metal device and passing it either a Metal function (an object conforming to the MTLFunction protocol) or a Metal Compute Pipeline Descriptor (an object of type MTLComputePipelineDescriptor). The creation of the compute pipeline state can be done synchronously and asynchronously.

In the MTLDevice protocol you find these methods supporting all these options:

Metal Library and Function

To create the compute pipeline state, you can use a MTLFunction object. The MTLFunction protocol defines the interface for an object that represents a single Metal shader function that can be executed by the device as part of a graphics shader or compute function. To obtain a MTLFunction object, first create a MTLLibrary object from the device and then retrieve a MTLFunction object from the library.

The name of the function is the name of the kernel or shader that we are going to write to process the image. A MTLLibrary object can contain Metal shading language code that is compiled during the app build process or compiled at runtime from a text string containing Metal shading language source code.
The best option is to compile the library at compile time (when we build our application). This reduces the overhead of compiling the library at runtime, something very common in OpenGL.

Use a MTLDevice method to create a MTLLibrary object. To create a MTLLibrary object from a Metal library binary, call one of these MTLDevice methods:

  • newDefaultLibrary()
  • makeLibrary(filepath:) throws
  • makeLibrary(data:) throws

To create a MTLLibrary object by compiling source code, call one of these MTLDevice methods:

  • makeLibrary(source:options:) throws
  • makeLibrary(source:options:completionHandler:)

The MTLLibrary contains a set of functions (objects conforming to the MTLFunction protocol.
For a rendering pass, you specify a MTLFunction object as a vertex or fragment shader when you configure a MTLRenderPipelineDescriptor object. For data-parallel compute processing, you specify a MTLFunction object as a compute function when you create a MTLComputePipelineState object.

You can query the MTLFunction properties at runtime. The functionType property defines what kind of function it is (MTLFunctionType.Vertex, MTLFunctionType.Fragment, MTLFunctionType.Kernel), and the vertexAttributes property defines the arguments that it takes.

The makeFunction(name:) method is used to fetch functions from the library, which makes that code available as a shader for either a MTLRenderPipelineState object for a render command encoder or for a MTLComputePipelineState for a compute command encoder.

Resources and Data

If you want to process data on the GPU, you need to convert the data in Metal resources. The MTLResource protocol defines the interface for any resource object that represents an allocation of GPU memory. In Metal, there are two types of MTLResource objects:

  • Buffers: objects conforming to the MTLBuffer protocol. They represent an allocation of unformatted memory that can contain any type of data. Buffers are often used for vertex, shader, and compute state data.
  • Texture: objects conforming to theMTLTexture protocol. They represent an allocation of formatted image data with a specified texture type and pixel format. Texture can be 1D, 2D or 3D. Texture objects are used as source textures for vertex, fragment, or compute functions, as well as to store graphics rendering output (that is, as an attachment).

To create a MTLBuffer object, you use the following MTLDevice methods:

  • newBufferWithLength(_:options:) creates a MTLBuffer object with a new storage allocation.
  • makeBuffer(bytes:length:options:) creates a MTLBuffer object by copying data from an existing storage allocation into a new allocation.
  • makeBuffer(bytesNoCopy:length:options:deallocator:) creates a MTLBuffer object that reuses an existing storage allocation and does not allocate any new storage.

The following methods create and return a MTLTexture object:

  • The makeTexture(descriptor:) method of the MTLDevice protocol creates a texture object with a new storage allocation for the texture image data, using a MTLTextureDescriptor object to describe the texture’s properties.
  • The makeTextureView(pixelFormat:) and makeTextureView(pixelFormat:textureType:levels:slices:) methods of the MTLTexture protocol create and return a new texture object that shares the same storage allocation as the source texture object. Because they share the same storage, any changes to the pixels of the new texture are reflected in the source texture, and vice versa. For the newly created texture, these methods reinterpret the existing texture image data in the storage allocation of the source texture as if this data were stored in the new specified pixel format. The pixel format of the new texture must be compatible with the pixel format of the source texture.
  • In iOS, the makeTextureWith(descriptor:offset:bytesPerRow:) method of the MTLBuffer protocol creates and returns a new texture object that shares the storage allocation of the source buffer object as its texture image data. Because they share the same storage, any changes to the pixels of the new texture are reflected in the source buffer, and vice versa.

A MTLTextureDescriptor object is used to configure new texture objects. To create a new texture, first create a MTLTextureDescriptor object and set its property values, including the texture’s type, size (width, height, and depth), pixel format, number of mipmap levels, sample count (for multisampling), and memory allocation behavior. You can reuse a MTLTextureDescriptor object, modifying its property values as needed, to create more MTLTexture objects. Texture descriptor properties are used only during the creation of a MTLTexture object. After the texture has been created, property changes in its descriptor have no further effects on it.

After you create a texture, you can call replace(region: mipmapLevel: slice: withBytes: bytesPerRow: bytesPerImage:) or replace(region: mipmapLevel: withBytes: bytesPerRow:) to populate the storage allocation of the texture object with image data from system memory.

Instead, you call getBytes(_: bytesPerRow: bytesPerImage: fromRegion: mipmapLevel: slice:) or getBytes(_: bytesPerRow: fromRegion: mipmapLevel:) to copy image data from a texture object and store the copied data into system memory.

Resources organization

Buffers and textures are organized in buffer and texture tables. There are a maximum of 31 entries in the buffer table and 31 entries in the texture table. Each entry is represented by an index. You need to pass this index to the command encoder. The index is then used by the Metal function (kernel) as an attribute of the function argument, as we will see later.

Thread Groups

When you perform parallel computing on the GPU, the execution of a kernel is decomposed in multiple threads. Threads are organized into thread groups. Threads in a thread group cooperate by sharing data through thread group memory and by synchronizing their execution to coordinate memory accesses to both device and thread group memory.

It is good practice to ask the Compute Pipeline State about the maximum number of threads per thread group. You can use the maxTotalThreadsPerThreadgroup property to obtain this information. This number is a multiple of a thread execution width. For best performance, the total number of threads in the thread group should be a multiple of the thread_execution_width and must be lower than the maximum total threads per thread group.

You can define the size of a thread group using for example the size of the texture to be processed:

The last line of the previous chunk of code encodes the command to dispatch the thread groups.

Let’s build an example

Let’s start an example here showing how to process an image using Metal. Create a new Xcode project and call it MetalImage. Use Swift as main language. Open the ViewController.swift and let’s add 2 outlets:

In the storyboard, add an image view and a slider to the same view controller view and connect the first outlet to the image view and the second outlet to the slider.

Interface Builder

I am going to use the slider to control an input parameter of the kernel function.
Go back to the ViewController.swift and add import Metal to the view controller. Also import MetalKit (we will see discuss MetalKit in a future post). Let’s add some properties to our view controller:

Then, in the viewDidLoad() method, let’s add the following:

The setupMetal() method creates the device, library, function and pipeline state objects:

In the viewDidAppear(_:), we can call the remaining methods:

Here, I dispatch the methods importTexture(), applyFilter() and image(from:) on a secondary queue. Finally, I dispatch the rendering of the final image on the main queue.

The first method (importTexture()) is used to convert a UIImage in a Metal texture. I could create a MTLTextureDescriptor object containing the specifications of the image and use it to create the texture. However, in iOS 9 Apple introduced MetalKit. This new framework offers an API to import images using a MTKTextureLoader object. So, I initialize a texture loader passing the Metal device.

Once I create a texture from the input image, I get its texture descriptor and use it to create the output texture.

The next method I need to build is applyFilter(). This is the central part of the chain. This is were I encode commands and send them to the command buffer.

The last method to implement in the viewDidAppear(_:) is just boilerplate code to convert the output texture to a UIImage object so that you can display it on the screen.

Kernel Functions or Shaders

Now, we need define the kernel function or compute shader. This is a C++ function that is executed on the GPU. Here, Apple chose C++11 as main programming language. I would have preferred to use Swift also for this piece of the chain, but this is what we get now. I will talk more about kernels or compute shaders in future posts. For the moment, add a Metal file in your project and add the following code:

Now, add a Metal file to your Xcode project.

You can download the entire project from here.


In this post, I showed you how to setup a Metal project for parallel computing using the GPU. Next time, we will see how to use it to perform real-time image processing using the iPhone video camera. Check also this post that I wrote in 2012 to demonstrate how to build a custom video camera. So, you will be able to compare Metal to the Accelerate framework and appreciate its computing speed.


Geppy Parziale (@geppyp) is cofounder of InvasiveCode (@invasivecode). He has developed iOS applications and taught iOS development since 2008. He worked at Apple as iOS and OS X Engineer in the Core Recognition team. He has developed several iOS and OS X apps and frameworks for Apple, and many of his development projects are top-grossing iOS apps that are featured in the App Store.



(Visited 7,896 times, 7 visits today)