Natural Language Processing and Speech Recognition in iOS

Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) and Computational Linguistics (CL) concerned with the interactions between computers and human natural languages. NPL is related to the area of Human-Computer Interaction (HCI) and the ability of a computer program to understand human speech as it is spoken.

Speech Recognition (SR) is a sub-field of computational linguistics that develops methodologies and technologies enabling the recognition and translation of spoken language into text by computers.

The development of NLP and SR applications is challenging, because computers traditionally require humans to 'speak' to them in a programming language that is precise, unambiguous and highly structured. Let's see here what iOS 10 currently offers for these technologies.

Linguistic Tagger

The NSLinguisticTagger is a class of the Foundation framework. Introduced with iOS 5, this class can be used to segment natural-language text and tag it with information, such as parts of speech. It can also tag languages, scripts, stem forms of words, etc.

Combined with the new Speech framework (available in iOS 10), the linguistic tagger can get recognition of live and prerecorded speeches and can receive transcriptions, alternative interpretations, and confidence levels.

To use the linguistic tagger, you create an instance of NSLinguisticTagger using the init(tagSchemes:options:) method. This init requires an array of linguistic tag schemes and a set of options (for example, to omit white spaces, punctuation, or to join names).

The API provides many linguist tag schemes: NSLinguisticTagSchemeTokenType, NSLinguisticTagSchemeLexicalClass, NSLinguisticTagSchemeNameType, NSLinguisticTagSchemeNameTypeOrLexicalClass, NSLinguisticTagSchemeLemma, NSLinguisticTagSchemeLanguage, and NSLinguisticTagSchemeScript. Each of these tag schemes provides different information related to the element of a sentence.

Let's make a small example. Suppose we want to analyze the following sentence:

Do you know about the legendary iOS training in San Francisco provided by InvasiveCode?

Here the source code:

This source code generates the following result:

As you can see, the linguistic tagger generated a tag for each element of the sentence.

Which languages do you speak?

You can use the linguistic tagger with other spoken languages too. Indeed, the linguistic tagger recognizes the language of each part of a sentence.

In the following example, I input an Italian sentence:

Once you execute the above source code, the value of the constant language is "it", since the sentence is in Italian.

Let's see now what iOS offers for the Speech Recognition.

Speech Framework

Introduced in iOS 10, the Speech framework performs speech recognition by communicating with Apple's servers or using an on-device speech recognizer, if available.

The speech recognizer is not available for every language. To find out if the speech recognizer is available for a specific spoken language, you can request the list of the supported languages using the class method supportedLocales() defined in the SFSpeechRecognizer class.

Because your app may need to connect to the Apple servers to perform recognition, it is essential that you respect the privacy of your users and treat their utterances as sensitive data. Hence, you must get the user's explicit permission before you initiate the speech recognition. Similarly to other iOS frameworks (for example, Core Location), you can request user permissions by adding the NSSpeechRecognitionUsageDescription key to the app Info.plist and providing a sentence explaining to the user why your application needs to access the speech recognizer. After that, you request user authorization in your application using the class method requestAuthorization(_:). When this method is executed, the application presents an alert to the user requesting authorization to access the speech recognizer. If the user provides the access to the recognizer, then you can use it.

Once the user grants your application permission to use the recognizer, you can create an instance of the SFSpeechRecognizer class and a speech recognition request (an instance of SFSpeechRecognitionRequest). You can create two types of requests: SFSpeechURLRecognitionRequest and SFSpeechAudioBufferRecognitionRequest. The first type performs the recognition of a prerecorded on-disk audio file. The second request type performs live audio recognition (using the iPhone or iPad microphone).

Before starting a speech recognition request, you should check if the speech recognizer is available using the isAvailable property of the SFSpeechRecognizer class. If the recognizer is available, then you pass the speech recognition request to the SFSpeechRecognizer instance using either the recognitionTask(with:delegate:) method or the recognitionTask(with:resultHandler:) method. Both methods return an SFSpeechRecognitionTask and start the speech recognition.

During the speech recognition you can use the speech recognition task to check the status of the recognition. The possible states are: starting, running, finishing, canceling, and completed. You can also cancel and finish the speech recognition tasks using the cancel() and the finish() methods. If you do not call finish() on the task, the task will go on. So, be sure to call this method when the audio source is exhausted.

During the speech recognition you can use the SFSpeechRecognitionTaskDelegate protocol for a fine-grained control of the speech recognition task. The protocol provides the following methods:

Since the speech recognition is a network-based service, some limits are enforced by Apple. In this way, the service can remain freely available to all apps. Individual devices may be limited in the number of recognitions that can be performed per day and an individual app may be throttled globally, based on the number of requests it makes per day. For these reasons, your application must be prepared to handle the failures caused by reaching the speech recognition limits.

Mix NPL and Speech Recognition

Let's start integrating the speech recognition framework in a new app named YourSpeech. Create a new single-view application. Open the Info.plist file and add the "Privacy - Speech Recognition Usage Description" or the NSSpeechRecognitionUsageDescription key. Then, provide a sentence explaining to the users how they can use speech recognition in your app. Since we are going to use the microphone, we also need to ask permission to access the microphone. So, add also "Privacy - Microphone Usage Description" or the key NSMicrophoneUsageDescription and provide a sentence explaining why your app wants to access the microphone.

Open the ViewController.swift file and add import Speech to import the Speech module. In the same view controller, let's add the following property:

Here, I instantiate the SFSpeechRecognizer to American English. Then, I set the view controller as delegate of the speech recognizer. If the speech recognizer cannot be initialized, then nil is returned. Add also the SFSpeechRecognizerDelegate protocol close to the class name:

Let's also add the following outlet for a button that we will use to start the voice recording:

In the viewDidLoad() method, you can add the following lines of code to print the spoken languages supported by the speech recognizer:

Then, let's check for the user authorization status. So, add this source code to the viewDidLoad method:

If the user grants permission, you don't have to request it again. After the user grants your app permission to perform speech recognition, create a speech recognition request.

Let's add the following property to the view controller:

The audio engine will manage the recording and the microphone.

The startRecodingButton will execute the following action:

In this action method, I check if the audio engine is running. If it is running, I stop it and tell the recognition request that the audio ended. Then, I disable the startRecordingButton and set its title to "Stopping". If the audio engine is not running, I call the method startRecording (see below) and set the startRecordingButton to "Stop recording".

The recognitionRequest is a property of the view controller:

As explained before this is one of the type of recognition requests the Speech framework can perform. Before defining the startRecording() method, let's add a new property to the view controller:

This property defines the speech recognition task. The startRecording() method does most of the job:

You will need to add the following properties to the view controller:

Additionally, add also a text view to the view controller in the storyboard and add and connect the following outlet to the text view:


I demonstrated you how to use the linguistic tagger to analyze text and how to perform speech recognition with the new Speech framework. You can combine these two functionalities with other iOS frameworks obtaining incredible results.

Happy coding!


Eva Diaz-Santana (@evdiasan) is cofounder of InvasiveCode. She develops iOS applications and teaches iOS development since 2008. Eva also worked at Apple as Cocoa Architect and UX designer. She is an expert of remote sensing and 3D reconstruction.



(Visited 1,572 times, 1 visits today)