Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio

Shogo Ando
Hello everyone.
In this article recently, we compared the recognition rates of OpenAI's Whisper and AmiVoice.
Measuring the speech recognition rate of OpenAI's Whisper (AmiVoice VS Whisper)
Looking back at this article, the results were obvious: "AmiVoice wins in its field of expertise, and Whisper wins in its field of expertise." So this time, I'll be looking at the results.Which one will give better results for conference audio?We will compare them from this perspective.
Verification method
The verification methods and conditions are as follows:
- We used audio from a meeting provided by one of our clients for research and development purposes.
- The conference audio was from four different industries, and the first 10 minutes of each was used, totaling approximately 40 minutes.
- The above audio and its transcripts are not used to train the speech recognition engine.
- Both AmiVoice and Whisper performed voice recognition processing in our local environment.
- The voice recognition engineAmiVoice APIWe used the equivalent of "Conversation_general" from , and Whisper's "large" and "large-v2".
- Speech recognition processing was performed on AmiVoice around June 2022, on Whisper (large) around October 2022, and on Whisper (large-v2) in February 2023, and the latest versions at that time were used for each.
- Speech recognition accuracy was measured character by character (not word by word).
- We have checked and corrected any misrecognitions due to variations in spelling through automatic conversion and visual inspection. However, since we were checking visually, there may still be some oversights.
- Filler(Necessary words) are the correct sentence andSpeech recognitionIt was decided to remove it from the results and calculate it.
The key point here is that the audio from this conference was from a relatively large company or organization, the moderator was on hand, each speaker had their own microphone, and the speakers spoke at an appropriate volume to explain things to all participants (not talking to themselves or muttering). This audio is easy for humans to hear, meaningRelatively low difficulty level for speech recognition in meetingsThis is what we can say.
Measurement results
■AmiVoice (Conversation_General)
| Data | Number of correct characters | Number of insertion errors | Number of deletion errors | Number of substitution errors | Speech recognition accuracy |
| ① | 2905 | 33 | 46 | 34 | 96.11% |
| ② | 2616 | 23 | 21 | 27 | 97.29% |
| ③ | 3011 | 38 | 23 | 44 | 96.51% |
| 4 | 3047 | 36 | 21 | 39 | 96.85% |
| Total | 11579 | 130 | 111 | 144 | 96.68% |
*Whisper (large)
| Data | Number of correct characters | Number of insertion errors | Number of deletion errors | Number of substitution errors | Speech recognition accuracy |
| ① | 2903 | 49 | 113 | 199 | 87.56% |
| ② | 2630 | 37 | 190 | 60 | 89.09% |
| ③ | 3009 | 47 | 138 | 146 | 89.00% |
| 4 | 3046 | 65 | 223 | 90 | 87.59% |
| Total | 11588 | 198 | 664 | 495 | 88.29% |
*Whisper (large-v2)
| Data | Number of correct characters | Number of insertion errors | Number of deletion errors | Number of substitution errors | Speech recognition accuracy |
| ① | 2901 | 60 | 132 | 161 | 87.83% |
| ② | 2625 | 67 | 204 | 55 | 87.58% |
| ③ | 3011 | 40 | 117 | 122 | 90.73% |
| 4 | 3044 | 65 | 189 | 85 | 88.86% |
| Total | 11581 | 232 | 642 | 423 | 88.80% |
Consideration
AmiVoice had the best results. In terms of error rate (CER), AmiVoice was 100% - 96.68% =3.32%Similarly, Whisper(large)11.71%,Whisper(large-v2)11.20%And,3 times or moreIt was a very big difference.
AmiVoice's speech recognition rate of 96.68% is a relatively high level among our in-house experiments. The microphone was set up properly, the speaker spoke clearly to the participants, and the voice was easy for humans to hear, which is probably why it achieved such high accuracy.
What concerns me is that Whisper makes a fair number of false positives. When I checked the false positives, I found that there were two main types of false positives.
- Whisper misrecognition pattern 1:Mistaking similar-sounding words out of context
- Competition → After today
- Trending favorably → Beware of favorable conditions
- Everywhere is investing → Dog is investing too
- At ◯◯ or something → ◯◯ Internal Medicine
- Whisper misrecognition pattern 2:It uses kanji conversion that I have not seen often.
- Weighted average → overfilled average
- Quarterly report → Commercial machine report
- Surplus funds management → Good management
- Remaining period → Interim period
- Foreign currency → surgical fee
- Transient → familial
Pattern 1 may sound a little like "maybe that's what it sounds like?" to a human listener, but when you consider the context and other factors, it seems like a strange misrecognition. (This pattern also occurs with AmiVoice, but it was more frequent with Whisper in this audio.)
Pattern 2 uses kanji conversion that doesn't come up much in web searches, so it's a mystery why it produced such an output.
Also, Whisper has a particularly high number of deletion errors. I was surprised when I investigated the cause, but WhisperIntelligently delete unnecessary textFor example, in response to the following utterance, Whisper may output something like this:*1
- Speaker: "January, no, the first of February, wasn't it February 1st? Sorry, is that okay? About that day."
- Whisper: "Regarding February 1st..."*2
This process will improve the readability of the speech recognition results, but it cannot be considered the correct answer.The omitted parts are deletion errors.The deletion error caused by this behavior may not be a problem depending on the application, so it may be ok to take it into account when evaluating it.
Summary
This time, we compared AmiVoice and Whisper using conference audio.
The results showed a significant difference, with AmiVoice having an error rate (CER) of less than one-third.
Whisper seems to have a tendency to output similar-sounding words without considering the context, and it often misrecognizes words by converting them to kanji characters that are not commonly used in Japanese.
Whisper also tends to intelligently delete unnecessary text, which makes it prone to more deletion errors. These errors may not be a problem depending on your use case, so you can take them with a pinch of salt when comparing and evaluating.
Person who wrote this article
-

Shogo Ando
While researching speech recognition, I found a speech recognition company nearby and joined the company, where I continue to work to this day.
My hobbies are traveling abroad, eating delicious food, and saunas.
: @anpyan
*1:Due to the sensitivity of the data, the actual speech has been dramatized.
*2:Just to be sure, I ran Whisper through the voice recognition of just this sentence, and the entire sentence was output in the speech recognition results without any omissions. Perhaps the system changes its behavior depending on the context and overall structure of the sentence.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use Zenn Coupon & Trial
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
