Tech blog
  • HOME
  • Blog
  • [Comparative verification using the same utterance] Differences in recognition results between Voice Input engines and Conversation engines

[Comparative verification using the same utterance] Differences in recognition results between Voice Input engines and Conversation engines

Published: 2023.09.29 Last updated: 2025.03.04


totonoi Totonoi Samurai

Hello! I'm Totonoi Samurai, a sales representative.

This time, we will explain the features of engines that adopt the AmiVoice API's acoustic model for Voice Input (hereinafter referred to as "Voice Input engines") and engines that adopt the acoustic model for Conversation input (hereinafter referred to as "Conversation engines"), as well as the usage scenarios that match each of them.

Differences between Voice Input engines and Conversation engines

First, let's start by explaining the basic mechanisms of speech recognition.
This article will provide a brief explanation, but if you would like more details, please see the following article:

A quick explanation of how speech recognition works!


AmiVoice Cloud Platform-Tech Blog

Our Hybrid speech recognition engine consists of a "language model" + "acoustic model" + "pronunciation dictionary".
A "language model" is a model that is learned from large amounts of text data and expresses the probability of which words or phrases are likely to appear before or after a certain word or phrase.
Recently, large-scale language models such as ChatGPT have emerged and become popular, but speech recognition language models are similar. While they are not as intelligent as ChatGPT, they are easily customizable and run quickly, using little memory.
AmiVoice API offers two language models: a "General-purpose" model that can be used in a wide range of situations and businesses, and a "Domain-specific" model that recognizes specialized terminology such as medical and financial terminology with high accuracy.

An "acoustic model" is a model that learns the characteristics of sounds such as "あ", "い", and "う" based on a large amount of data. Even for the same Japanese language, we provide optimal acoustic models depending on the purpose, such as speaking style and speaking environment. In AmiVoice API, acoustic models are basically, There are two types: "Voice Input","Conversation".

Before explaining the overview of the two engines, let's first explain the difference between speech during general voice input and speech during conversation.

Speech during voice input

- Speaks relatively slowly and has clear pronunciation
・There are few hesitation *1
・There are times when you want to input punctuation and symbols by voice.

Utterances during conversation

-Tends to speak quickly and slurred speech
・Tends to hesitate a lot
・Do not speak punctuation marks

Based on the above, we have summarized the features of the Voice Input engine and Conversation engine in a table.

Voice Input engine Conversation Engine
Matching Audio ・Speak in a tone that sounds like you're talking to a device such as a smartphone or tablet
・Clear pronunciation
・Conversational tone when speaking person to person
・Somewhat unclear pronunciation is acceptable
Usage scenario Voice Input for daily reports and emails, and speech generation by voicebots Conversations such as meetings and calls
Learning to Hesitation Few Many
Recognizes speech with punctuation and symbols Supported Some are not supported
Number of words learned by the General-purpose engine Many (about 1.5 times more than the conversation engine)
→ Voice Input is acoustically easy to pronounce, so mistakes are unlikely even with a large number of registered words.
Ordinary
→Conversation is acoustically difficult, so if there are many registered words, it is easy to confuse them with words that sound similar.

Which engine to use depends on the audio you want to recognize.
For example, if a salesperson wants to take notes on the details of a sales negotiation into their iPhone after a business meeting, they can use the voice input engine, and if a call center wants to recognize the voice of a conversation between an operator and a customer, they can use the conversation engine.*2
Please see below for some specific examples of customers who are actually using the service.

■Voice Input engine

Hamee Co., Ltd.


acp.amivoice.com

■ Conversation engine

BellFace Inc.


acp.amivoice.com

Accuracy was verified

So let's actually use the two engines to verify the recognition rate.

Verification method

The verification methods and conditions are as follows:

・We used audio provided by our customers for research and development purposes.
・The following two types of audio were used:
① Voice recordings of conversations during meetings, presentations, etc.
② A voice that speaks daily reports or news manuscripts as if speaking to a device
・The audio files ① and ② are each approximately 30 minutes long.
・The speech recognition engine used was the AmiVoice API's "Voice Input_General-Purpose" and "Conversation_General-Purpose" engines.
・Speech recognition accuracy was measured character by character (not word by word).
・Misrecognitions due to variations in spelling were corrected by automatic conversion and visual checks. Since this was a visual check, there may still be some oversights.
・Hesitations (Filler words) were removed from the correct sentence and speech recognition results before calculation.

Please see the following article for information on how to measure recognition rates.

We measured the speech recognition accuracy (speech recognition rate) of the AmiVoice speech recognition engine.


AmiVoice Cloud Platform-Tech Blog

Measurement results

① Voice recordings of conversations during meetings, presentations, etc.

Number of correct characters, Number of insertion errors, Number of deletion errors, Number of substitution errors, Speech recognition accuracy.

Voice Input_General-Purpose 9789 448 1316 1063 71.12%
Conversation_General-purpose 9788 474 356 350 87.94%

It seems that Voice Input engines are not suited to conversational language.
On the other hand, the Conversation engine has a much better recognition rate than the Voice Input engine.

② A voice that speaks daily reports, news articles, etc. in a tone that sounds like someone is talking to the device

Number of correct characters, Number of insertion errors, Number of deletion errors, Number of substitution errors, Speech recognition accuracy.

Voice Input_General-Purpose 5721 19 28 81 97.76%
Conversation_General-purpose 5727 38 29 78 97.47%

Although there is not much difference between the Voice Input engine and the Conversation engine, the Voice Input engine had a slightly higher recognition rate. If you are only using it in scenario ②, it may be a good idea to use the Voice Input engine.

Recognition result details

① Voice recordings of conversations during meetings, presentations, etc.

<Correct answer>
"採用向けのサイトもうん企業様の撮影なんですけどうんうんこの案件は基本的には私はちょっとお受けしてないんですね"

<Recognition results of the Voice Input engine>
"採用向けのサイトの大きいおさまの撮影なんですけど(うんうん)この案件は基本的には私と同期してないんですね"

<Recognition results of the Conversation engine>
"採用向けのサイトうん企業様の撮影なんですけどうんうんうんこの案件は基本的には私はちょっとお受けしてないんですね"

② A voice that speaks daily reports, news articles, etc. in a tone that sounds like someone is talking to the device

<Correct answer>
"自分にとって価値がなければどんなに際立った違いがあっても振り向いてはくれないのです"

<Recognition results of the Voice Input engine>
"自分にとって価値がなければどんなに際立った違いがあっても振り向いてはくれないのです"

<Recognition results of the Conversation engine>
"自分にとって価値がなければどんなに気を当たった違いがあっても振り向いてはくれないのです"

Consideration

・In case ①, the speech was quite casual and the sound quality was poor, so the accuracy was relatively low, whereas in case ②, the speech was slightly clearer than in case ① and the audio was of high quality, so recognition was highly accurate.
・The Conversation engine is a versatile type that can recognize with high accuracy in both situations ① and ②, but if you are speaking only in a specific usage scenario like ②, it would be better to use the Voice Input engine.
・The number of deleted errors (number of erroneous characters that are spoken in the audio file but not included in the recognition results) for ① Voice Input_General-purpose is extremely high, but this is thought to be because the speech recognition engine did not consider the utterances to be human speech. It may be that Voice Input engines do not recognize ambiguous and casual words as human speech.

Summary

This time, we introduced the features of Voice Input engines and compared them with Conversation engines.
We hope you understand that using different engines depending on the application will enable more accurate recognition.
The language model tested this time was only general-purpose, but we also have domain-specific engines for finance, insurance, etc., including "Voice Input" and "Conversation" engines, so please feel free to try them out.
(60 minutes free for each engine every month!)
Well, that's all for now.

Person who wrote this article

  • Totonoi Samurai

    I am a sales employee in my third year after graduating from university.
    Recently, I often go to local saunas.
    It's a pleasure to enjoy local gourmet food after the sauna.

*1: Also called "fillers". These are meaningless words such as "えっと", "そのー", and "あのー". AmiVoice API has a function to automatically delete hesitations, reducing the effort required to correct recognition results later.
*2: Because call center voices are transmitted over telephone lines, they often have a sound quality equivalent to a sampling frequency of 8 kHz. The only engine that supports this sound quality is the Conversation_General-purpose Engine. If you would like to recognize 8 kHz voices using a non-general engine (such as in medical or financial fields), please use AmiVoice API Private.

Use API for Free