Tech blog
  • HOME
  • Blog
  • Three types of speech recognition APIs on the AmiVoice Cloud Platform (asynchronous HTTP speech recognition API has been newly added)

Three types of speech recognition APIs on the AmiVoice Cloud Platform (asynchronous HTTP speech recognition API has been newly added)

Published: 2021.10.08 Last updated: 2025.03.07

andou Shogo Ando

Hello everyone.

We provide cloud speech recognition API for software developers. (AmiVoice Cloud Platform)  We have provided two types of APIs with different protocols depending on the purpose, but we have now added another type.We now offer three types of APIs:.

This time, we will explain three types of APIs, focusing on the newly added APIs.

What APIs have been added?

The API added this time is "Asynchronous HTTP”.

Until now, there was a streaming (real-time) voice recognitionWebSocket" and " for batch speech recognition.HTTP Speech Recognition API" There were two of these.HTTP Speech Recognition API" previously recognized audio files up to 16MBytes, but the newly added API now allows for voice recognition of files larger than 16MBytes.

In addition, the API added this time is "Asynchronous HTTP", so the name of the previous API was changed to "HTTP Speech Recognition APIFrom "Synchronous HTTP" has been changed to ".

In summary, the AmiVoice Cloud Platform will have the following three types of APIs:

  • WebSocket Speech Recognition API
  • (Name change)Synchronous HTTP speech recognition API
  • (NEW!)Asynchronous HTTP speech recognition API

I will briefly explain each one.

WebSocket

WebSocketcan convert audio streams into text in real time, which is suitable for applications where you need to use speech recognition results in real time, such as:

  • Convert call center conversations into text in real time
  • Real-time transcription of meeting remarks
  • Voice control of smartphones and IoT devices
  • Speech dialogue system

WebSocketUsing this allows you to obtain speech recognition results while you are speaking, or obtain final results immediately after detecting the end of speech, but in exchange, you need to send the audio data as a stream. This means that it requires a bit more effort to implement, such as handling the audio data in binary format and controlling the recording device as necessary.

(Name change)Synchronous HTTP

Synchronous HTTPYou can convert audio files into text. It works very simply; once you send the audio file, it will undergo speech recognition processing and return the results once processing is complete. It is suitable for converting short audio files like the one below into text.

  • Convert short audio files such as voice memos and voicemails into text
  • PoC of systems using voice recognition and evaluation of voice recognition accuracy

This will be explained nextAsynchronous HTTPThe operation sequence is briefly illustrated below for comparison.Synchronous HTTPThe sequence is as follows:

Synchronous HTTPIn this case, there will be a waiting period on the application side from the time the audio file is sent until the speech recognition process is complete. Also, the session needs to be kept connected during this time, but if the session is disconnected midway, you will have to start over, so we have set an upper limit (16MBytes) on the size of the audio file that can be sent.

(NEW!)Asynchronous HTTP

Asynchronous HTTPBut,Synchronous HTTPIt can also convert audio files to text, but the process is slightly different. It is suitable for converting long audio files such as those shown below, or for converting a large number of audio files to text.

  • Converting call center call recordings into text
  • Converting meeting recording audio files into text
  • Convert video files to text and create subtitles

Asynchronous HTTPThe sequence is as follows:

When you send an audio file, a value called sessionid is returned. You can then use this sessionid to check the status of the speech recognition process and obtain the speech recognition results.

Asynchronous HTTPWhen you execute the API, a response is returned immediately. Therefore, there is no need to maintain a session, and speech recognition processing of large audio files exceeding 16MBytes is also possible.

Synchronous HTTPWhen,Asynchronous HTTPWhich should you use?

When recognizing an audio file, should you use synchronous or asynchronous voice recognition? Please refer to the following.

  • Synchronous HTTP
    It is relatively easy to implement, so it is suitable for handling small audio files or for trial purposes. However, please note that it does not support file sizes over 16MBytes.
  • Asynchronous HTTP
    It will take some effort to implement, but it is something that could not be done before.*1It also supports large audio file sizes, making it ideal for those who want to convert audio from long voice calls or conferences into text. Of course, it can also handle small audio file sizes.

Details

The manuals required for actual development are listed below, so please see here for details.

I/F Specifications Asynchronous HTTP Speech Recognition API Overview – AmiVoice Cloud Platform

At the end

This time it was added in October 2021Asynchronous HTTPI have explained the three current APIs, including the above. We would like to continue providing APIs that are easier to use depending on the application. If you have any opinions or requests, please send them to us in the comments.

Person who wrote this article

  •  

    Shogo Ando

    While researching speech recognition, I found a speech recognition company nearby and joined the company, where I continue to work to this day.

    My hobbies are traveling abroad, eating delicious food, and saunas.

*1:In fact, it was possible to process large audio files by streaming them to the Websocket speech recognition API. However, this method can place an unexpectedly high load on the speech recognition server, so it is recommended to use aAsynchronous HTTPWe hope you will use it.

Use API for Free