[Comparative verification using the same utterance] Differences in recognition results between Voice Input engines and Conversation engines

Totonoi Samurai
Hello! I'm Totonoi Samurai, a sales representative.
This time, we will explain the features of engines that adopt the AmiVoice API's acoustic model for Voice Input (hereinafter referred to as "Voice Input engines") and engines that adopt the acoustic model for Conversation input (hereinafter referred to as "Conversation engines"), as well as the usage scenarios that match each of them.
Differences between Voice Input engines and Conversation engines
First, let's start by explaining the basic mechanisms of speech recognition.
This article will provide a brief explanation, but if you would like more details, please see the following article:
AmiVoice Cloud Platform-Tech Blog
Our Hybrid speech recognition engine consists of a "language model" + "acoustic model" + "pronunciation dictionary".
A "language model" is a model that is learned from large amounts of text data and expresses the probability of which words or phrases are likely to appear before or after a certain word or phrase.
Recently, large-scale language models such as ChatGPT have emerged and become popular, but speech recognition language models are similar. While they are not as intelligent as ChatGPT, they are easily customizable and run quickly, using little memory.
AmiVoice API offers two language models: a "General-purpose" model that can be used in a wide range of situations and businesses, and a "Domain-specific" model that recognizes specialized terminology such as medical and financial terminology with high accuracy.
An "acoustic model" is a model that learns the characteristics of sounds such as "あ", "い", and "う" based on a large amount of data. Even for the same Japanese language, we provide optimal acoustic models depending on the purpose, such as speaking style and speaking environment. In AmiVoice API, acoustic models are basically, There are two types: "Voice Input","Conversation".
Before explaining the overview of the two engines, let's first explain the difference between speech during general voice input and speech during conversation.
Speech during voice input
- Speaks relatively slowly and has clear pronunciation
・There are few hesitation *1
・There are times when you want to input punctuation and symbols by voice.
Utterances during conversation
-Tends to speak quickly and slurred speech
・Tends to hesitate a lot
・Do not speak punctuation marks
Based on the above, we have summarized the features of the Voice Input engine and Conversation engine in a table.
| Voice Input engine | Conversation Engine | |
|---|---|---|
| Matching Audio | ・Speak in a tone that sounds like you're talking to a device such as a smartphone or tablet ・Clear pronunciation |
・Conversational tone when speaking person to person ・Somewhat unclear pronunciation is acceptable |
| Usage scenario | Voice Input for daily reports and emails, and speech generation by voicebots | Conversations such as meetings and calls |
| Learning to Hesitation | Few | Many |
| Recognizes speech with punctuation and symbols | Supported | Some are not supported |
| Number of words learned by the General-purpose engine | Many (about 1.5 times more than the conversation engine) → Voice Input is acoustically easy to pronounce, so mistakes are unlikely even with a large number of registered words. |
Ordinary →Conversation is acoustically difficult, so if there are many registered words, it is easy to confuse them with words that sound similar. |
Which engine to use depends on the audio you want to recognize.
For example, if a salesperson wants to take notes on the details of a sales negotiation into their iPhone after a business meeting, they can use the voice input engine, and if a call center wants to recognize the voice of a conversation between an operator and a customer, they can use the conversation engine.*2
Please see below for some specific examples of customers who are actually using the service.
■Voice Input engine
■ Conversation engine
Accuracy was verified
So let's actually use the two engines to verify the recognition rate.
Verification method
The verification methods and conditions are as follows:
・We used audio provided by our customers for research and development purposes.
・The following two types of audio were used:
① Voice recordings of conversations during meetings, presentations, etc.
② A voice that speaks daily reports or news manuscripts as if speaking to a device
・The audio files ① and ② are each approximately 30 minutes long.
・The speech recognition engine used was the AmiVoice API's "Voice Input_General-Purpose" and "Conversation_General-Purpose" engines.
・Speech recognition accuracy was measured character by character (not word by word).
・Misrecognitions due to variations in spelling were corrected by automatic conversion and visual checks. Since this was a visual check, there may still be some oversights.
・Hesitations (Filler words) were removed from the correct sentence and speech recognition results before calculation.
Please see the following article for information on how to measure recognition rates.
AmiVoice Cloud Platform-Tech Blog
Measurement results
① Voice recordings of conversations during meetings, presentations, etc.
Number of correct characters, Number of insertion errors, Number of deletion errors, Number of substitution errors, Speech recognition accuracy.
| Voice Input_General-Purpose | 9789 | 448 | 1316 | 1063 | 71.12% |
|---|---|---|---|---|---|
| Conversation_General-purpose | 9788 | 474 | 356 | 350 | 87.94% |
It seems that Voice Input engines are not suited to conversational language.
On the other hand, the Conversation engine has a much better recognition rate than the Voice Input engine.
② A voice that speaks daily reports, news articles, etc. in a tone that sounds like someone is talking to the device
Number of correct characters, Number of insertion errors, Number of deletion errors, Number of substitution errors, Speech recognition accuracy.
| Voice Input_General-Purpose | 5721 | 19 | 28 | 81 | 97.76% |
|---|---|---|---|---|---|
| Conversation_General-purpose | 5727 | 38 | 29 | 78 | 97.47% |
Although there is not much difference between the Voice Input engine and the Conversation engine, the Voice Input engine had a slightly higher recognition rate. If you are only using it in scenario ②, it may be a good idea to use the Voice Input engine.
Recognition result details
① Voice recordings of conversations during meetings, presentations, etc.
<Correct answer>
"採用向けのサイトもうん企業様の撮影なんですけどうんうんこの案件は基本的には私はちょっとお受けしてないんですね"
<Recognition results of the Voice Input engine>
"採用向けのサイトの大きいおさまの撮影なんですけど(うんうん)この案件は基本的には私と同期してないんですね"
<Recognition results of the Conversation engine>
"採用向けのサイトのうん企業様の撮影なんですけどうんうんうんこの案件は基本的には私はちょっとお受けしてないんですね"
② A voice that speaks daily reports, news articles, etc. in a tone that sounds like someone is talking to the device
<Correct answer>
"自分にとって価値がなければどんなに際立った違いがあっても振り向いてはくれないのです"
<Recognition results of the Voice Input engine>
"自分にとって価値がなければどんなに際立った違いがあっても振り向いてはくれないのです"
<Recognition results of the Conversation engine>
"自分にとって価値がなければどんなに気を当たった違いがあっても振り向いてはくれないのです"
Consideration
・In case ①, the speech was quite casual and the sound quality was poor, so the accuracy was relatively low, whereas in case ②, the speech was slightly clearer than in case ① and the audio was of high quality, so recognition was highly accurate.
・The Conversation engine is a versatile type that can recognize with high accuracy in both situations ① and ②, but if you are speaking only in a specific usage scenario like ②, it would be better to use the Voice Input engine.
・The number of deleted errors (number of erroneous characters that are spoken in the audio file but not included in the recognition results) for ① Voice Input_General-purpose is extremely high, but this is thought to be because the speech recognition engine did not consider the utterances to be human speech. It may be that Voice Input engines do not recognize ambiguous and casual words as human speech.
Summary
This time, we introduced the features of Voice Input engines and compared them with Conversation engines.
We hope you understand that using different engines depending on the application will enable more accurate recognition.
The language model tested this time was only general-purpose, but we also have domain-specific engines for finance, insurance, etc., including "Voice Input" and "Conversation" engines, so please feel free to try them out.
(60 minutes free for each engine every month!)
Well, that's all for now.
Person who wrote this article
-

Totonoi Samurai
I am a sales employee in my third year after graduating from university.
Recently, I often go to local saunas.
It's a pleasure to enjoy local gourmet food after the sauna.
*1: Also called "fillers". These are meaningless words such as "えっと", "そのー", and "あのー". AmiVoice API has a function to automatically delete hesitations, reducing the effort required to correct recognition results later.
*2: Because call center voices are transmitted over telephone lines, they often have a sound quality equivalent to a sampling frequency of 8 kHz. The only engine that supports this sound quality is the Conversation_General-purpose Engine. If you would like to recognize 8 kHz voices using a non-general engine (such as in medical or financial fields), please use AmiVoice API Private.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
- AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
