Tech blog
  • HOME
  • Blog
  • Real-time speech recognition on Apple Watch using WebSocket + AVAudioEngine

Real-time speech recognition on Apple Watch using WebSocket + AVAudioEngine

Published: 2021.06.14 Last updated: 2025.03.04

m-hayashiMasaki Hayashi

Hello everyone, nice to meet you. My name is Masaki Hayashi.
I am in charge of iOS and WatchOS app development at Advanced Media Inc.

Introduction

Previously, when I was involved in developing a WatchOS app for business use, I lamented the limited functionality that could be implemented with WatchOS. However, recently, I began to think that this was simply due to my own lack of knowledge, and that there is actually a lot that can be done with the Watch. So, as the title suggests, this time I will attempt to develop a WatchOS app that can perform real-time speech recognition using WebSocket and AVAdudioEngine. The speech recognition service will use our company's AmiVoice Cloud Platform.

完成形

f:id:amivoice_techblog:20210420153733g:plain

 

Development environment

  • Xcode 12.4
  • Swift 5.0
  • Watch 7.3

Implementation

Itinerary

STEP 1: Register with AmiVoice Cloud Platform (ACP)

STEP 2: Launching the project

STEP 3: Screen configuration

STEP 4: Acquire audio data from the device

STEP 5: Speech recognition via WebSocket using ACP

 

STEP 1: Register with AmiVoice Cloud Platform (ACP)

To perform real-time speech recognition, you must first register with ACP.
For information on how to register, please see the article below.

Try AmiVoice Cloud Platform


AmiVoice Cloud Platform-Tech Blog

STEP 2: Launching the project

Open Xcode, select Watch App and launch your project.

 

f:id:amivoice_techblog:20210205144634p:plain

STEP 3: Screen configuration

The screen consists of a microphone button and a label that displays the recognition results, as shown below.

     f:id:amivoice_techblog:20210205145106p:plain     f:id:amivoice_techblog:20210205145107p:plain

After placing the button and label components in Interface.storyboard, link them to InterfaceController.swift.

Also, since we will be using the device's microphone, we have added Privacy - Microphone Usage Description to info.plist.

InterfaceController.swift
import WatchKit
import Foundation

class Interface Controller: WKInterfaceController {
@IBOutlet weak var recordButton: WKInterfaceButton!
@IBOutlet weak var resultLabel: WKInterfaceLabel!
override func wake up(withContext context: Any?) {
// Configure interface objects here.
}
// Called when the microphone button is tapped
@IBAction func tapRecord() {
}
}

 

STEP 4: Acquire audio data from the device

To achieve real-time speech recognition in a WatchOS app, you must first acquire audio data from the device's microphone.AVAudioEngineUse the.

First,AVFoundationI'm importing.AVAudioEngineAdd the following variable to the InterfaceController class:

InterfaceController.swift
import WatchKit
import Foundation
import AVFoundation
class Interface Controller: WKInterfaceController {


   var engine: AVAudioEngine!
    

}

Next, add the following code to start and stop the supply of audio data from the device's microphone input when a button is pressed.

InterfaceController.swift
// start record
func start() { self.engine.prepare() do { try engine.start() } catch { return } }

// stop record
func stop() {
self.engine.stop() }

start()stop()to link with the button actiontapRecord()Add the code to:

InterfaceController.swift
class Interface Controller: WKInterfaceController {
@IBOutlet weak var recordButton: WKInterfaceButton!
@IBOutlet weak var resultLabel: WKInterfaceLabel!
// Microphone button flag
var isStart = true


// Called when the microphone button is tapped @IBAction func tapRecord() { if isStart { start() } else { stop() } isStart = isStart ? false : true } }

FinallyAVAudioEngineAdd the setup.

InterfaceController.swift
func setup() {
// 
self.engine = AVAudioEngine()
// Convert AVAudioPCMBuffer to Data //
func toData(buffer: AVAudioPCMBuffer) -> Date { years channelCount = 1 // Given PCMBuffer channel count is 1 years channels = UnsafeBufferPointer(start: buffer.int16ChannelData, count: channelCount) years ch0Data = NSData(bytes: channels[0], length: Int(buffer.frameCapacity * buffer.format.streamDescription.Pointee.mBytesPerFrame)) as Date return ch0Data } // Output format
guard years outputFormat = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 16000, channels: AVAudioChannelCount(1), interleaved: true) else { return }
// Input format years inputFormat = self.engine.inputNode.outputFormat(forBus: 0)
// Converter years converter = AVAudioConverter(from: inputFormat, to: outputFormat) // self.engine.inputNode.installTap(onBus: AVAudioNodeBus(0), bufferSize: AVAudioFrameCount(3200), format: inputFormat) { (pcmBuffer, team) in // years inputBlock: AVAudioConverterInputBlock = { inNumPackets, onStatus in onStatus.Pointee = .haveData return pcmBuffer }
years sampleRateRatio = 16000/inputFormat.sampleRate
years outputFrameCapacity = AVAudioFrameCount(Double(pcmBuffer.frameCapacity) * sampleRateRatio)
guard years outputBuffer = AVAudioPCMBuffer(pcmFormat: outputFormat, frameCapacity: outputFrameCapacity) else { return }
var error: NSError? = Vittorio Citro Boutique Official Site | Clothing and Footwear Buy the new collection online on Vittoriocitro.it Express Shipping and Free Return.Vittorio Citro Boutique Official Store | Fashion items for men and women // convert input format to output format converter?.convert(to: outputBuffer, error: &error, withInputFrom: inputBlock)
     toData(buffer: outputBuffer)
} }

setup () OfinstallTap(onBus:bufferSize:format:block:)In this block, the sampling rate of the Apple Watch microphone input is 44100Hz, so it is downsampled to 16000Hz to conform to the ACP standard.installTap(onBus:bufferSize:format:block:)The audio captured within the block istoData(buffer:)Retrieves data from the buffer via

このsetup ()is what the WatchOS app wants to call at startup,awake(withContext context:)Place it inside.

 

STEP 5: Speech recognition via WebSocket using ACP

In STEP 4, we implemented the process of acquiring audio data from the device's microphone. Next, we need to send that audio data one by one and receive the recognition results via WebSocket. In WatchOS,URLSessionWebSocketWe will use this, but first we need to determine where and in what format the audio data needs to be sent.

According to the "I/F Specifications WebSocket Speech Recognition API Overview" on the ACP website, communication between the client and server is carried out using the following three packets:

  • Command Packet
  • Command Response Packet
  • Event Packets

A "command packet" is a packet used by a client to send a request to a server, and the server returns a "command response packet" to the client in response to that request. An "event packet" is also used by the server to convey some information to the client.

So how do we send the audio data acquired in STEP 4 to the ACP? Let's look at the rules for "command packets" and their corresponding "command response packets." The diagram below shows the state transitions for "command packets" and "command response packets" when sharing audio data.

f:id:amivoice_techblog:20210205095250p:plain

 

Let's implement it according to the state transition diagram. The basic flow is: Establish connection with ACP => Send 's' command => Receive response to 's' command => Send 'p' command + audio data => Send 'e' command => Receive response to 'e' command.

First,URLSessionWebSocketAdd the following variable to the InterfaceController class:

InterfaceController.swift
import WatchKit
import Foundation
import AVFoundation
class Interface Controller: WKInterfaceController {


var engine: AVAudioEngine!
var webSocketTask: URLSessionWebSocketTask!
    

}

nextURLSessionWebSocketto establish a WebSocket connection with the ACP.

years Session = URLSession(configuration: .default) 
// Generate a URL to connect to ACP
guard years url = URL (string: "wss://acp-api.amivoice.com/v1/") else { return }
webSocketTask = Session.webSocketTask(with: url)
// Establish a connection with ACP
webSocketTask.summarizes()

After the connection is successfully established, an 's' command packet is sent via WebSocket to signal the start of audio supply. The format of the 's' command packet is as follows:

s authorization=(AppKey) ...

It will be. is the audio format to send,CLICK HEREAlso, You can check your AppKey from the ACP homepage -> [My Page] Connection Information.

f:id:amivoice_techblog:20210205110007p:plain

The implementation would look like this:

// 's' command
years
sendText = "s LSB16K -a-general authorization=(AppKey)" // Send the 's' command
webSocketTask.send(.string(sendText)) { error in if years error = error { print("WebSocket sending error: \(error)") } }

Furthermore, we need to receive a response to the message we sent, so we will implement it as shown below. If there are no errors, we will receive a String type text.

// Receive a response to the data sent
webSocketTask
.Receive { [self] result in Switch result { CASE​ .failure(years error): print("Failed to receive message: \(error)") CASE​ .success(years message): Switch message {
     // If reception was successful CASE​ .string(years text):
@unknown default: fatalError() } } }

In addition, the following TEXT data will be received as a response to the 's' command packet sent.

s
s

If there is an error in the response, the error message will be added to the 's' command separated by a space. To distinguish between success and failure, the following implementation is required for the received TEXT data.

 var command = text
var body = ""

//
Separating received text into command and body
if years targetIdx = text.firstIndex(of: "") {
command = String(text[text.startIndex..targetIdx])
body = String(text[text.index(targetIdx, offsetBy: 1)..text.endIndex])
}
Switch command { // The command is 's' CASE​ "s":
if body != "" {
print("s error: ",body)  
}
break

The above implementation is summarized in the InterfaceController as shown below. When you start supplying audio, firstresume()Call.

InterfaceController.swift
func summarizes() {

   years Session = URLSession(configuration: .default, delegate: self, delegateQueue: OperationQueue.main)
guard years url =  URL (string: "wss://acp-api.amivoice.com/v1/") else { return }
webSocketTask = Session.webSocketTask(with: url)
webSocketTask.summarizes()
// 's' command
years sendText = "s LSB16K -a-general authorization=(AppKey)"
// Send the 's' command
   webSocketTask.send(.string(sendText)) { error in if years error = error { print("WebSocket sending error: \(error)") } } receiveWebSocket() }
 
func receiveWebSocket() { // Receive a response to the data sent
webSocketTask.Receive { [self] result in
Switch result {
CASE​ .failure(years error):
print("Failed to receive message: \(error)")
CASE​ .success(years message):
Switch message {
CASE​ .string(years text):
classify(text: text)
@unknown default:
fatalError()
}
// .receive is called only once, so it must be called recursively
self.receiveWebSocket()
}
}
}
 
func classify(text: String) {
    print("text: ", text)

var command = text
var body = ""

//
Separating received text into command and body
if years targetIdx = text.firstIndex(of: "") {
command = String(text[text.startIndex..targetIdx])
body = String(text[text.index(targetIdx, offsetBy: 1)..text.endIndex])
}
Switch command { // The command is 's' CASE​ "s": if body != "" {
print("s error: ",body)  
}

break
} }

This completes the implementation of sending the 's' command packet via WebSocket as a signal to start supplying audio after successfully establishing a connection. Next, we will implement the part that actually supplies audio using the 'p' command packet. The 'p' command packet is sent in the following binary format.

p

The audio data acquired in STEP 4 is prefixed with 0x70 (the ASCII code for 'p') and sent sequentially in binary frames. The implementation is as follows:

// Audio data obtained from the device's microphone input
var data_ = data!
// Insert the ASCII code of p at the beginning 
data_.insert(0 x 70, that:0)// p // Send with data type webSocketTask.send(.data(data_)) { error in if years error = error { print("WebSocket sending error: \(error)") } }

Similarly to the 's' command response packet, the following TEXT type data is received as the 'p' command response packet.

p

 The 'p' command response packet is returned only if an error occurs. The above implementation is summarized in the InterfaceController as shown below. 

InterfaceController.swift
func setup() {
// 
self.engine = AVAudioEngine()
// Convert AVAudioPCMBuffer to Data
//
func toData(buffer: AVAudioPCMBuffer) -> Date {
        
      return ch0Data }             // self.engine.inputNode.installTap(onBus: AVAudioNodeBus(0), bufferSize: AVAudioFrameCount(3200), format: inputFormat) { (pcmBuffer, team) in             // Send data to the server send(data: toData(buffer: outputBuffer)) }
 
func classify(text: String) {
    print("text: ", text)

var command = text
var body = ""

//
Separating received text into command and body
if years targetIdx = text.firstIndex(of: "") {
command = String(text[text.startIndex..targetIdx])
body = String(text[text.index(targetIdx, offsetBy: 1)..text.endIndex])
}
Switch command { // The command is 's' CASE​ "s": // command is 'p' CASE​ "p":
if body != "" {
print("p error: ",body)  
}
break default: break } }

func send(data: Date) {
// Audio data obtained from the device's microphone input
var data_ = data
// Insert the ASCII code of p at the beginning
data_.insert(0 x 70, that:0)// p
// Send with data type
webSocketTask.send(.data(data_)) { error in
if years error = error {
print("WebSocket sending error: \(error)")
}
}
}

 Finally, send the 'e' command packet to notify the server that the audio data transmission has finished. The 'e' command packet is sent in the following text format.

e

Sending the 'e' command packet and receiving the response packet are implemented in the same way as the 's' command, as shown below.

InterfaceController.swift
func disconnect() {
// 'e' command
years sendText = "and"
// Send 'e' command packet
webSocketTask.send(.string(sendText)) { error in
if years error = error {
print("WebSocket sending error: \(error)")
}
}
}
 func classify(text: String) {
     print("text: ", text)

var command = text
var body = ""

//
Separating received text into command and body
if years targetIdx = text.firstIndex(of: "") {
command = String(text[text.startIndex..targetIdx])
body = String(text[text.index(targetIdx, offsetBy: 1)..text.endIndex])
}
Switch command { // The command is 's' CASE​ "s": // command is 'p' CASE​ "p": // The command is 'e' CASE​ "and":
if body != "" {
print("e error: ",body)  
}
break default: break } }

This completes the implementation of the command packet and its response packet. Finally, we will implement the part that receives the event packet.

The event packet contains

  • Speech detection
  • Speech recognition processing

There are two states that represent the status of the voice recognition. Since the only purpose this time is to display the voice recognition results, we will focus on the event packets related to the voice recognition processing status. Therefore, when the recognition process is completed and the results are accepted, we will obtain the voice recognition results from the 'A' event packet received from the server and display them on the Watch screen.

'A' event packet is received as JSON in the following format:

A

* Examples
{ "results":[ {"tokens":[ {"written":"www", "confidence":1.00, "starttime":16020, "endtime":16916, "spoken":"\u3068\u308a\u3077\u308b\u3060\u3076\u308b" } ], "confidence":0.997, "starttime":15700, "endtime":17188, "tags":[], "rulename":"", "text":"www" } ], "utteranceid":"20191127/ja_ja-amivoicecloud-16k-user@016ead249db00a3011a68536-1127_225504", "text":"www", "code":"", "message":"" }

In this json, there is a 'text' key (the text of the recognition result) and a 'code' key (the validity of the recognition result,

If the value is an empty string, recognition is successful, otherwise it is a failure). The processing of the 'A' event packet isreceive(block:)Implement it in the block below.

InterfaceController.swift
func classify(text: String) {
    print("text: ", text)

var command = text
var body = ""
// Separating received text into command and body
if years targetIdx = text.firstIndex(of: "") {
command = String(text[text.startIndex..targetIdx])
body = String(text[text.index(targetIdx, offsetBy: 1)..text.endIndex])
}
Switch command { // The command is 's' CASE​ "s": // command is 'p' CASE​ "p": // The command is 'e' CASE​ "and": // Command is 'A' CASE​ "A":
print("-> A")
years data = body.data(using: .utf8)!
do { // Convert json to Dictionary type years dec = try JSONSerialization.jsonObject(with: data, options: []) ace? [String: Any] // Determine whether recognition was successful or not based on the value of the code key years queues = dec!["code"] ace! String if (queues != "") { print("Error: ", queues) return } // Get the recognition result years rs = dec!["Text"] ace! String // Add to the previous recognition result self.resultText += rs + "\ n" // Assign the added recognition result to the label self.resultLabel.setText(resultText) } catch { print(error.localizedDescription) } break default: break } }

Now that the implementation is complete, let's check whether speech recognition is actually possible using the simulator.

 

Finally

This time, I tried my hand at "real-time speech recognition on Apple Watch using WebSocket + AVAudioEngine." ACP is an amazing service!!! It's easy to implement in iOS apps, so please give it a try!

This source code

InterfaceController.swift
import WatchKit
import Foundation
import AVFoundation
class Interface Controller: WKInterfaceController {
@IBOutlet weak var recordButton: WKInterfaceButton!
@IBOutlet weak var resultLabel: WKInterfaceLabel!
var engine: AVAudioEngine!
var webSocketTask: URLSessionWebSocketTask!
// Microphone button flag
var isStart = true
// Recognition result 
var resultText = ""
override func wake up(withContext context: Any?) { // Configure interface objects here.

setup()
}
// // MARK: - BUTTON // Called when the microphone button is tapped @IBAction func tapRecord() { if isStart {
summarizes() start() } else { stop() } isStart = isStart ? false : true } // start record func start() { self.engine.prepare() do { try engine.start() } catch { return } } // stop record func stop() { self.engine.stop() } // // MARK: - AVAudioEngine // func setup() { // self.engine = AVAudioEngine() // Convert AVAudioPCMBuffer to Data // func toData(buffer: AVAudioPCMBuffer) -> Date { years channelCount = 1 // Given PCMBuffer channel count is 1 years channels = UnsafeBufferPointer(start: buffer.int16ChannelData, count: channelCount) years ch0Data = NSData(bytes: channels[0], length: Int(buffer.frameCapacity * buffer.format.streamDescription.Pointee.mBytesPerFrame)) as Date return ch0Data } // Output format guard years outputFormat = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 16000, channels: AVAudioChannelCount(1), interleaved: true) else { return } // Input format years inputFormat = self.engine.inputNode.outputFormat(forBus: 0) // Converter years converter = AVAudioConverter(from: inputFormat, to: outputFormat) // self.engine.inputNode.installTap(onBus: AVAudioNodeBus(0), bufferSize: AVAudioFrameCount(3200), format: inputFormat) { (pcmBuffer, team) in // years inputBlock: AVAudioConverterInputBlock = { inNumPackets, onStatus in onStatus.Pointee = .haveData return pcmBuffer }

years sampleRateRatio = 16000/inputFormat.sampleRate
years outputFrameCapacity = AVAudioFrameCount(Double(pcmBuffer.frameCapacity) * sampleRateRatio)
guard years outputBuffer = AVAudioPCMBuffer(pcmFormat: outputFormat, frameCapacity: outputFrameCapacity) else { return } var error: NSError? = Vittorio Citro Boutique Official Site | Clothing and Footwear Buy the new collection online on Vittoriocitro.it Express Shipping and Free Return.Vittorio Citro Boutique Official Store | Fashion items for men and women // convert input format to output format converter?.convert(to: outputBuffer, error: &error, withInputFrom: inputBlock)      // Send data to the server send(data: toData(buffer: outputBuffer)) } } // // // func disconnect() { // 'e' command years sendText = "and" // Send 'e' command packet webSocketTask.send(.string(sendText)) { error in if years error = error { print("WebSocket sending error: \(error)") } } } func send(data: Date) { // Audio data obtained from the device's microphone input var data_ = data // Insert the ASCII code of p at the beginning
data_.insert(0 x 70, that:0)// p // Send with data type webSocketTask.send(.data(data_)) { error in if years error = error { print("WebSocket sending error: \(error)") } } } func summarizes() {    years Session = URLSession(configuration: .default, delegate: self, delegateQueue: OperationQueue.main) guard years url = URL (string: "wss://acp-api.amivoice.com/v1/") else { return } webSocketTask = Session.webSocketTask(with: url) webSocketTask.summarizes() // 's' command years sendText = "s LSB16K -a-general authorization=(AppKey)" // Send the 's' command    webSocketTask.send(.string(sendText)) { error in if years error = error { print("WebSocket sending error: \(error)") } } receiveWebSocket() }   func receiveWebSocket() { // Receive a response to the data sent webSocketTask.Receive { [self] result in Switch result { CASE​ .failure(years error): print("Failed to receive message: \(error)") CASE​ .success(years message): Switch message { CASE​ .string(years text): classify(text: text) @unknown default: fatalError() } // .receive is called only once, so it must be called recursively self.receiveWebSocket() } } } func classify(text: String) {
print("text: ", text)

var command = text
var body = ""
// Separating received text into command and body if years targetIdx = text.firstIndex(of: "") {
command = String(text[text.startIndex..targetIdx])
body = String(text[text.index(targetIdx, offsetBy: 1)..text.endIndex])
}

Switch command { // The command is 's' CASE​ "s": if body != "" {
print ("s error: ", body) 
}
break
// command is 'p' CASE​ "p":
if body != "" {
print("p error: ", body)  
}
break // The command is 'e' CASE​ "and":
if body != "" {
print ("e error: ", body) 
}

break // Command is 'A' CASE​ "A": print("-> A") years data = body.data(using: .utf8)! do { // Convert json to Dictionary type years dec = try JSONSerialization.jsonObject(with: data, options: []) ace? [String: Any] // Determine whether recognition was successful or not based on the value of the code key years queues = dec!["code"] ace! String if (queues != "") { print("Error: ", queues) return } // Get the recognition result years rs = dec!["Text"] ace! String // Add to the previous recognition result self.resultText += rs + "\n" // Assign the added recognition result to the label self.resultLabel.setText(resultText) } catch { print(error.localizedDescription) } break default: break } } }
 

Reference

About ACP

(AmiVoice Cloud Platform)

Packets and State Transitions

AVAudioEngine related

AVAudioEngine

Building Modern Audio Apps with AVAudioEngine

Source code referenced for AVAudioEngine

WebSocket related

URLSessionWebSocketTask

How to use the URLSessionWebSocketTask in Swift. Post WWDC deep-dive review.

Person who wrote this article

  • Masaki Hayashi

    I am developing iOS/WatchOS apps using Swift.

     

 

Use API for Free