We measured the speech recognition accuracy (speech recognition rate) of the AmiVoice speech recognition engine.

Shogo Ando
Hello everyone.
A common question from AmiVoice users is, "What percentage is AmiVoice's speech recognition accuracy (speech recognition rate)?" However, I think that not many people know what speech recognition accuracy actually is or how to measure it.
This time, we will introduce AmiVoice's voice recognition API, AmiVoice Cloud Platform ( https://acp.amivoice.com/ ) speech recognition results will be used to actually measure the speech recognition accuracy.
The method for calculating speech recognition accuracy is explained in the following article.
Regarding the "recognition accuracy (recognition rate)" of speech recognition
AmiVoice Cloud Platform-Tech Blog
Testing speech recognition accuracy
There are five steps to measuring speech recognition accuracy:
- Prepare the audio
- Transcribe the audio to create the answer
- Speech recognition and text conversion
- Consistently spell out voice recognition results and correct spelling variations
- Calculate speech recognition accuracy (compare the results of 2 and 4)
So let's get started.
1. Prepare the audio
First, prepare the audio to measure the speech recognition accuracy. In this example, I prepared the following audio.
■Audio provided this time
- Content: Call center complaint voice (acted)
- Format: Stereo (operator voice on the right channel, customer voice on the left channel)*1
- Audio length: Approximately 43.5 seconds*2
- Sampling rate: 16kHz
- Bit rate: 16bit
Here's the actual audio file.
2. Transcribe the audio and create the answer
Transcribe the prepared audio*3The transcript is below.
By the way, since the audio this time is acting, there is a script, but normally you would listen to the audio and transcribe it.
■Manually Transcribed Correct AnswersOperator: Thank you for calling. This is Tanaka from the Insurance Call Center.Customer: Hello.Operator: Yes.Customer: Kobayashi Insurance.Operator: Yes. This is Kobayashi Insurance.Customer: So where are you from? Yokohama.Operator: This is the call center at our Tokyo headquarters.Customer: Oh, is that so?Operator: Yes.Customer: I have a quick question.Operator: Here you go.Customer: I, um, applied to cancel my policy the other day.Operator: Yes.Customer: About three or four months ago.Operator: Yes.Customer: And you said, "Okay, I understand," but I thought you'd just canceled it or were going through the process, or that it was too late for that.Operator: Yes.Customer: And so they still haven't deducted the money from it since then.Operator: I'm sorry.Customer: Yeah.At that time...Operator: Yes.Customer:The person who did that was Yamaguchi-san. Operator: This is Yamaguchi.
3. Speech recognition and text conversion
Run the speech recognition. This time, in addition to AmiVoice Cloud Platform, we also tried speech recognition with Google Cloud Speech-to-Text for comparison.
■AmiVoice Cloud Platform settings:
- The voice recognition engine used was "Conversation_General".
- We used the WebSocket version of speech recognition (Wrp).
- The stereo audio file was converted into two mono audio files, one for the left and one for the right channel, and each was subjected to voice recognition.
■Google Cloud Speech-to-Text settings:
- This was done using the C# client library (Google.Cloud.Speech.V1 v2.1.0)
- This was done using asynchronous speech recognition.
- The recognition model used was default. (Phone-call would be more appropriate, but as of March 2021, it is not available in Japanese.)
- The stereo audio file was converted into two mono audio files, one for the left and one for the right channel, and each was subjected to voice recognition.
The voice recognition result is as follows:*4
■Results from AmiVoice Cloud PlatformOperator: Thank you. This is Tanaka from the Insurance Call Center.Customer: Hello.Operator: Yes.Customer: Kobayashi Insurance.Operator: Yes, Kobayashi Insurance.Customer: So where are you from?YokohamaOperator: This is the call center at our Tokyo headquarters.Customer: I see.Operator: Yes.Customer: I have a quick question.Operator: Please go ahead.Customer: I applied to cancel the policy the other day.Operator: Yes.Customer: About three or four months ago.Operator: Yes.Customer: So I said, "Okay, I understand."I thought you had just canceled the policy or were still going through the process, which seemed a bit late, butOperator: Yes.Customer: And they still haven't deducted the money from the policy since then.Operator: I'm sorry.Customer: Then, the person who did that was someone called Yamaguchi.Operator: Yamaguchi, correct?
■ Google Cloud Speech-to-Text resultsOperator: Thank you. This is Tanaka from the insurance call center.Customer: Hello.Operator: Yes.Customer: Kobayashi Insurance.Operator: Yes, Kobayashi Insurance.Customer: So where are you from?YokohamaOperator: This is the call center at our Tokyo headquarters.Customer: Oh, is that so?Operator: Yes.Customer: Let me just ask you something.Operator: Yes, please.Customer: I applied to cancel the contract the other day.Operator: Yes.Customer: About three or four months ago.Operator: Yes.Customer: Well, I said, "Okay, I understand."I thought you had just canceled the contract or were still going through the process.I thought it was a bit late.Operator: Yes.Customer: That's the reason why I haven't been deducted since then.Operator: I'm sorry.Customer: The person who did that at the time was someone named Yamaguchi.Operator: It's Yamaguchi.
The content of both sentences is close to the correct answer, and it is clear that speech recognition was performed with very high accuracy.*5
4. Consistently spell out the results of speech recognition and correct spelling variations
Once the voice recognition results are complete, take a closer look at the results. You will notice that the characteristics of the voice recognition results from AmiVoice and Google are slightly different.
| Item | AmiVoice results features | Google Results Features |
|---|---|---|
| Punctuation | automatically inserted | Not inserted |
| filler | Removed | Not removed |
Fillers are meaningless phrases such as "um" or "ah" that fill gaps in conversation.
The presence or absence of punctuation and whether unnecessary words are removed are functions of the speech recognition engine and do not directly affect the recognition accuracy. Therefore, if you want to calculate, especially compare, speech recognition accuracy, you need to properly align the conditions for how to handle these punctuation marks and unnecessary words. Here, we will use the following conditions.
- Punctuation: Removed from both the transcription and the speech recognition results.
- Fillers: Removed from both the transcription and the speech recognition results.
In addition, AmiVoice can be set to not remove unnecessary words (fillers). For details, please refer to the manual below. *Search for "filler" to see how to set it up.Manual Archive – AmiVoice Cloud Platform
Additionally, when measuring speech recognition accuracy, it is necessary to take into account "orthographic variations." Orthographic variations refer to words or phrases that have the same pronunciation and meaning but are written differently.
Specifically, for example, the following:
- Transcription: 3 moreMonthor 4MonthAbout a while ago.
- AmiVoice: 3 moremonthsor 4monthsAbout a while ago.
- Google: 3 moremonthsor 4monthsAbout a while ago
While the transcript shows "months," both AmiVoice and Google use "months." If you measure the speech recognition accuracy in this state, the "ka (ge)" part will be calculated as an error (a deduction of points), so this part needs to be corrected. In this case, you can simply correct the transcript to "months."
There are other places like this too.
- Transcript: And I'm still stuckTranslation.
- AmiVoice: I'm still hung up on that one.Reason.
- Google: Still not impressedthe reason
The transcription is written in kanji as "やく" but AmiVoice writes it as "わく" in hiragana. This seems like a good answer to consider, so we will correct the transcription to "わく".
The problem is that we need to decide whether Google's speech recognition of "wake" as "reason" is correct or incorrect. Reading "reason" as "wake" is a so-called ateji (reading) and is not the correct reading, so we will treat it as a misrecognition here.
5. Calculate speech recognition accuracy (compare steps 2 and 4)
Finally, we tally up the recognition accuracy. For a volume like this, we could just count it by hand, but since recognition accuracy usually involves thousands or tens of thousands of characters, Advanced Media has a dedicated measurement tool.*6This time, we used that tool to measure.
In addition, this measurement was calculated using "characters" as the smallest unit.*7.
The voice recognition accuracy of AmiVoice is as follows:
- Correct answer: 261 characters
- Insertion error: 1 characters
- Deletion error: 6 characters
- Substitution error: 1 characters
- Speech recognition accuracy = (261-1-6-1)/261 ≒ 96.93%
Google's speech recognition accuracy is as follows:
- Correct answer: 260 characters
- Insertion error: 0 characters
- Deletion error: 6 characters
- Substitution error: 4 characters
- Speech recognition accuracy = (260-0-6-4)/260 ≒ 96.15%
The reason why the number of correct characters differs between AmiVoice and Google is because the correct sentences were revised separately to match the recognition results when correcting spelling variations.
Comparison of recognition results between AmiVoice and Google
For this audio, we found that both AmiVoice and Google achieved very high speech recognition accuracy.
Since we're here, let's take a look at what parts we're misunderstanding.
- Transcript: I thought they just canceled it or are still going through the process.
- AmiVoice: I thought they just canceled it or are still going through the process.と
- Google: I thought they just canceled it or are still in the process of canceling it.
Here, only AmiVoice mistakenly inserted "と" at the end. This is an insertion error of 1 character.
- Transcription:ChaI thought it was slow
- AmiVoice: AndOnI thought it was slow
- Google:OnI thought it was slow
The transcription is "sore shi cha", but both AmiVoice and Google say "sore shi shite", so both have one character deletion error and one character replacement error.
- Transcription (AmiVoice):YupAt that time
- AmiVoice: At that time
- Transcription (Google):YupAt that time
- Google: OfSometimes
This is a little confusing, but AmiVoice recognized the "toki" in "sono tonoki ne" as "toki," while Google recognized it as "toki." I think both can be considered correct, so I corrected the transcription separately accordingly. The "un" part was not recognized by either AmiVoice or Google, so there were two characters deleted incorrectly. *AmiVoice may have removed it as filler.
Also, Google misidentified "sonotoki" as "todoke" (that time), so there was a substitution error of one character.
- Transcription:CallThank you
- AmiVoice: Thank you
- Google: Thank you
Neither AmiVoice nor Google recognized the word "telephone" here, so there were three characters deleted incorrectly. To begin with, the beginning of the audio file was cut off a little, so this may have been an unavoidable misrecognition.
At the end
Above, we have roughly explained the steps for calculating speech recognition accuracy.
The voice recognition accuracy for this test was quite high at over 96%, but this is likely due to the fact that the speaking style and sound quality were clear and the content of the speech was simple. Please note that in actual situations where voice recognition is used, it is common for recognition accuracy to be lower than this.
Measuring voice recognition accuracy is a complex subject, and there are still many details that I was unable to explain this time, so I would like to delve deeper into this topic in another blog post.
If you are a developer who has read this article and is interested in voice recognition technology or the AmiVoice Cloud Platform, please https://acp.amivoice.com/ Please try it out. You can use up to 60 minutes of audio for free each month, so please give it a try.
Person who wrote this article
-

Shogo Ando
While researching speech recognition, I found a speech recognition company nearby and joined the company, where I continue to work to this day.
My hobbies are traveling abroad, eating delicious food, and saunas.
*1:This is a slightly unusual format, as it was recorded using a dedicated device from a telephone call. The sampling rate of the audio file is 16 kHz, but the left channel (customer) is equivalent to a sampling rate of 8 kHz, as the sound was transmitted via the telephone line. The right channel (operator) is directly connected from the receiver to the recording device, so the sound quality is better, and it was recorded at a sampling rate equivalent to 16 kHz.
*2:The audio is actually much longer, but to prevent it from becoming too long for the article, I have limited it to the first 43.5 seconds.
*3:Transcription is the process of listening to audio and writing down what you hear. It can also be called "audio transcription," "transcription," or "tape transcription." At Advanced Media, we often refer to it as "transcription."
*4:The voice recognition results for the right channel (operator side) and left channel (customer side) are formatted to match the timing of the speech to make them easier to read.
*5:These are the results of an operation performed on December 2, 2020. Please note that AmiVoice (and probably Google as well) is improving daily, so the results may not always be the same.
*6:It is difficult to make the tool public, but in the future we would like to consider providing an explanation of the internal processing of the tool, as well as a page and API for calculating voice recognition accuracy.
*7:At Advanced Media, we refer to speech recognition accuracy, which uses characters as the smallest unit, as "character recognition accuracy." For example, "uketamawaru" (meaning "to accept") is calculated as two characters. There are also cases where "uketamawaru" (meaning "to accept") is calculated as six characters using the reading kana as the unit, or where "uketamawaru" is calculated as one word using the word as the unit.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
- AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
