Metal video processing for iOS and tvOS

Real-time video processing is a particular case of digital signal processing. Technologies such as Virtual Reality (VR) and Augmented Reality (AR) strongly rely on real-time video processing to extract semantic information from each video frame and use it for object detection and tracking, face recognition, and other computer vision techniques.

Processing video in realtime on a mobile device is a quite complex task, because of the limited resources available on smartphones and tablets, but you can achieve amazing results when using the right techniques.

In this post, I will show you how to process a video in realtime using the Metal framework leveraging the power of the GPU. In one of our previous posts, you can check the details on how to setup the Metal rendering pipeline and process compute shaders for image processing. Here, we are going to do something similar, but this time we will process video frames.

AV Foundation

Before we proceed with the implementation of the video processing in Metal, let's give a quick look at the AV Foundation framework and the components we need to play a video. In a previous post, I demonstrated you how to use AV Foundation to capture video with the iPhone or iPad camera. Here, we are going to use a different set of AV Foundation classes to read and play a video file on an iOS or tvOS device.

You can play a video on the iPhone or the Apple TV in different ways, but for the purpose of this post, I am going to use the AVPlayer class and extract each video frame, and pass them to Metal for realtime processing on the GPU.

An AVPlayer is a controller object used to manage the playback and timing of a media asset. You can use an AVPlayer to play local and remote file-based media, such as video and audio files. Besides the standard controls to play, pause, change the playback rate, and seek to various points in time within the media’s timeline, an AVPlayer object offers the possibility to access each single frame of a video asset through an AVPlayerItemVideoOutput object. This object returns a reference to a Core Video pixel buffer (an object of type CVPixelBuffer). Once you get the pixel buffer, you can then convert it into a Metal texture and thus, process it on the GPU.

Creating an AVPlayer is very simple. You can either use the file URL of the video or an AVPlayerItem object. So, to initialize an AVPlayer, you can use one of the following init methods:

An AVPlayerItem stores a reference to an AVAsset object, which represents the media to be played. An AVAsset is an abstract, immutable class used to model timed audiovisual media such as videos and audio. Since AVAsset is an abstract class, you cannot use it directly. Instead, you should use one of the 2 subclasses provided by the framework. You can choose between an AVURLAsset and an AVComposition (or AVMutableComposition. An AVURLAsset is a concrete subclass of AVAsset that you can use to initialize an asset from a local or remote URL. An AVComposition allows you to combine media data from multiple file-based sources in a custom temporal arrangement or its mutable subclass AVMutableComposition.

In this post, I am going to use the AVURLAsset. The following snippet of source code highlights how to combine all these AV Foundation classes together:

To extract the frames from the video file while the player is playing, you need to use an AVPlayerItemVideoOutput object. Once you get a video frame, you can use Metal to process it on the GPU. Let's now build a full example to demonstrate it.

Video Processor App

Create a new Xcode project. Choose an iOS Single View Application and name it VideoProcessor. Open the ViewController.swift file and import AVFoundation.

Since we need an AVPlayer, let's add the following property to the view controller:

As discussed above, the player provides access to each single video frame through an AVPlayerItemVideoOutput object. So, let's add an additional property to the view controller:

For this property, the attributes is a dictionary defining the pixel buffer format. Here, I am declaring that each pixel is 32 bits and organized in 4 bytes (8 bits) representing the Blue, Green, Red and Alpha channels. Then, I use the attributes dictionary to initialize the AVPlayerItemVideoOutput.

In the viewDidLoad() method, I load the video:

Since I want to read the video frame by frame, I need a timer that repeatedly asks the AVPlayerItemVideoOutput object to provide each frame. I am going to use a display link. This is a special timer that fires at each screen refresh. So, add the following property to the view controller:

The display link fires at 60 frames per second and executes the readBuffer(_:) method every ~16.6 ms. In this property, I pause the display link immediately. I will restart it when I start playing the video in the viewDidAppear(_:) method:

Now, let's implement the readBuffer() method:

The method readBuffer(_:) executes every 16.6 ms. It extracts a frame from the player item video output and converts it into a Core Video pixel buffer. I then take the pixel buffer and its timestamp and pass them to a Metal view (see later).

Now, I need to setup Metal. Open the Main.storyboard file and add a new view on top of the view controller view. Expand this new view to cover completely the view controller view. Then, add auto layout to this new view so that it is constrained top, bottom, left and right to the view controller view. After that, add a new Swift file to the project and name it MetalView. Edit this new file in the following way:

This is a subclass of a Metal View (defined in the MetalKit framework). Open again the storyboard and set the class type of the previously added view to be a MetalView. Add the following outlet to the view controller:

and connect this outlet to the MetalView in the storyboard.

I am going to setup the Metal view so that when I launch the application, the display link defined in the view controller will redraw the view executing the MetalView draw(_:) method. First, let's add the draw(_:) method to the MetalView class:

The draw(_:) method sets its own autorelease pool and calls the render(_:) method:

This method will do most of the work. In the render(_:) method, we are going to convert the CMSampleBuffer obtained from the AVPlayerItemVideoOutput to a Metal texture and process it on the GPU. This step is very important and the performance of your application are strictly dependent on how you perform this conversion. So, it is very important you do it correctly.

Let's prepare the MetalView class. First, add the following imports to the MetalView class:

Then, add the following properties to the MetalView:

The first two properties represents the pixel buffer obtained in the readBuffer(_:) method of the view controller and the timestamp of each video frame, respectively. Every time the view controller sets the pixel buffer property, the property executes setNeedsDisplay() that forces the redraw of the MetalView.

The textureCache property is a cache offered by the Core Video framework to speedup the conversion of a Core Video pixel buffer to a Metal texture (see later). The commandQueue and the computePipelineState are two important components of the Metal pipeline. Please, refer to this post for deeper details or check the Apple documentation. In the same post, you will find also a helper extension for the MTLTexture class. I need the same extension here. So, add the following code to the MetalView class:

Now, let's do the important portion of the source code. First, I need an init method to initialize the MetalView. Then, I need to implement the render(_:) method.

Since the MetalView was setup in the Storyboard, I can only initialize the view using the init(coder:) method. So, let's add the following init method to the MetalView class:

I added comments for each line of code. Check the Apple documentation for additional details. The last step is the implement of the render(_:) method:

Now, Metal and AV Foundation are completely setup. We miss the metal shader. In the init(coder:) method, I defined a metal function named colorKernel. So, add a new file to the project and name it ColorKernel.metal. Initially, I am going to implement a pass through (or pass-all) shader to simply visualize the video as it is. So, edit the ColorKernel file in the following way:

Let's run the project. You should see the video playing on the screen of your device.
Please, notice that we are currently not using the time input.

In my test, I played the following chuck of a video:

Let's now add some special effect on to the video. How? Well, either you know how to create the effects by using some math or you can look for it in the Internet. There are different websites used by people to publish their creations. One of them is ShaderToy. All the shaders you will find in the internet are based on OpenGL and the GLSL programming language. However, it is quite simple to port them to the Metal Shading Language. I took one of the visual effects from ShaderToy, modified a little bit and ported to the Metal Shading Language. Here it is my new kernel function or compute shader:

Once done, you can run again the application on your iOS device. The following video shows the final result applying the new shader to the previous video:

Really cool, right?


Working with shader is really fun. You can create really spectacular effects and animations within your app. You can combine the shader with SceneKit and create also special animations that the Core Animation framework cannot provide.

In a very similar way, you can use shaders to process the video frame and extract semantic information to track object or people in the video and so on. There is a lot of math involved in shader programming, but it helps to appreciate all those formulas and functions you learn in school. Next time, I will show you how to leverage your math knowledge to build really beautiful visual effects.

Have fun and see you at WWDC 2017.


Geppy Parziale (@geppyp) is cofounder of InvasiveCode. He has developed many iOS applications and taught iOS development to many engineers around the world since 2008. He worked at Apple as iOS and OS X Engineer in the Core Recognition team. He has developed several iOS and OS X apps and frameworks for Apple, and many of his development projects are top-grossing iOS apps that are featured in the App Store. Geppy is an expert in computer vision and machine learning.



(Visited 9,029 times, 3 visits today)