Tech blog
  • HOME
  • Blog
  • Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio

Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio

Published: 2023.06.26 Last updated: 2025.03.06

ando Shogo Ando

Hello everyone.

In this article recently, we compared the recognition rates of OpenAI's Whisper and AmiVoice.

Measuring the speech recognition rate of OpenAI's Whisper (AmiVoice VS Whisper)

Looking back at this article, the results were obvious: "AmiVoice wins in its field of expertise, and Whisper wins in its field of expertise." So this time, I'll be looking at the results.Which one will give better results for conference audio?We will compare them from this perspective.

Verification method

The verification methods and conditions are as follows:

  • We used audio from a meeting provided by one of our clients for research and development purposes.
  • The conference audio was from four different industries, and the first 10 minutes of each was used, totaling approximately 40 minutes.
  • The above audio and its transcripts are not used to train the speech recognition engine.
  • Both AmiVoice and Whisper performed voice recognition processing in our local environment.
  • The voice recognition engineAmiVoice APIWe used the equivalent of "Conversation_general" from , and Whisper's "large" and "large-v2".
  • Speech recognition processing was performed on AmiVoice around June 2022, on Whisper (large) around October 2022, and on Whisper (large-v2) in February 2023, and the latest versions at that time were used for each.
  • Speech recognition accuracy was measured character by character (not word by word).
  • We have checked and corrected any misrecognitions due to variations in spelling through automatic conversion and visual inspection. However, since we were checking visually, there may still be some oversights.
  • Filler(Necessary words) are the correct sentence andSpeech recognitionIt was decided to remove it from the results and calculate it.

The key point here is that the audio from this conference was from a relatively large company or organization, the moderator was on hand, each speaker had their own microphone, and the speakers spoke at an appropriate volume to explain things to all participants (not talking to themselves or muttering). This audio is easy for humans to hear, meaningRelatively low difficulty level for speech recognition in meetingsThis is what we can say.

Measurement results

AmiVoice (Conversation_General)

Data Number of correct characters Number of insertion errors Number of deletion errors Number of substitution errors Speech recognition accuracy
2905 33 46 34 96.11%
2616 23 21 27 97.29%
3011 38 23 44 96.51%
4 3047 36 21 39 96.85%
Total 11579 130 111 144 96.68%


*
Whisper (large)

Data Number of correct characters Number of insertion errors Number of deletion errors Number of substitution errors Speech recognition accuracy
2903 49 113 199 87.56%
2630 37 190 60 89.09%
3009 47 138 146 89.00%
4 3046 65 223 90 87.59%
Total 11588 198 664 495 88.29%


*
Whisper (large-v2)

Data Number of correct characters Number of insertion errors Number of deletion errors Number of substitution errors Speech recognition accuracy
2901 60 132 161 87.83%
2625 67 204 55 87.58%
3011 40 117 122 90.73%
4 3044 65 189 85 88.86%
Total 11581 232 642 423 88.80%

Consideration

AmiVoice had the best results. In terms of error rate (CER), AmiVoice was 100% - 96.68% =3.32%Similarly, Whisper(large)11.71%,Whisper(large-v2)11.20%And,3 times or moreIt was a very big difference.

AmiVoice's speech recognition rate of 96.68% is a relatively high level among our in-house experiments. The microphone was set up properly, the speaker spoke clearly to the participants, and the voice was easy for humans to hear, which is probably why it achieved such high accuracy.

What concerns me is that Whisper makes a fair number of false positives. When I checked the false positives, I found that there were two main types of false positives.

  • Whisper misrecognition pattern 1:Mistaking similar-sounding words out of context

    • Competition → After today
    • Trending favorably → Beware of favorable conditions
    • Everywhere is investing → Dog is investing too
    • At ◯◯ or something → ◯◯ Internal Medicine
  • Whisper misrecognition pattern 2:It uses kanji conversion that I have not seen often.
    • Weighted average → overfilled average
    • Quarterly report → Commercial machine report
    • Surplus funds management → Good management
    • Remaining period → Interim period
    • Foreign currency → surgical fee
    • Transient → familial

Pattern 1 may sound a little like "maybe that's what it sounds like?" to a human listener, but when you consider the context and other factors, it seems like a strange misrecognition. (This pattern also occurs with AmiVoice, but it was more frequent with Whisper in this audio.)

Pattern 2 uses kanji conversion that doesn't come up much in web searches, so it's a mystery why it produced such an output.

Also, Whisper has a particularly high number of deletion errors. I was surprised when I investigated the cause, but WhisperIntelligently delete unnecessary textFor example, in response to the following utterance, Whisper may output something like this:*1

  • Speaker: "January, no, the first of February, wasn't it February 1st? Sorry, is that okay? About that day."
  • Whisper: "Regarding February 1st..."*2

This process will improve the readability of the speech recognition results, but it cannot be considered the correct answer.The omitted parts are deletion errors.The deletion error caused by this behavior may not be a problem depending on the application, so it may be ok to take it into account when evaluating it.

Summary

This time, we compared AmiVoice and Whisper using conference audio.

The results showed a significant difference, with AmiVoice having an error rate (CER) of less than one-third.

Whisper seems to have a tendency to output similar-sounding words without considering the context, and it often misrecognizes words by converting them to kanji characters that are not commonly used in Japanese.

Whisper also tends to intelligently delete unnecessary text, which makes it prone to more deletion errors. These errors may not be a problem depending on your use case, so you can take them with a pinch of salt when comparing and evaluating.

Person who wrote this article

  • Shogo Ando

    While researching speech recognition, I found a speech recognition company nearby and joined the company, where I continue to work to this day.

    My hobbies are traveling abroad, eating delicious food, and saunas.

    x : @anpyan

*1:Due to the sensitivity of the data, the actual speech has been dramatized.

*2:Just to be sure, I ran Whisper through the voice recognition of just this sentence, and the entire sentence was output in the speech recognition results without any omissions. Perhaps the system changes its behavior depending on the context and overall structure of the sentence.

Use API for Free