Tech blog
  • HOME
  • Blog
  • Comparing the speech recognition accuracy of AmiVoice's domain-specific engines (General-prupose vs Electronic Medical Records)

Comparing the speech recognition accuracy of AmiVoice's domain-specific engines (General-prupose vs Electronic Medical Records)

Published: 2021.08.23 Last updated: 2025.03.04

s-andou Shogo Ando

Hello everyone.

Speech recognition is used in a variety of situations. It would be ideal if we could create an engine that could accurately recognize speech no matter what is said in any situation, but in reality, this is difficult to achieve. In particular, in special usage scenarios where technical terms are frequently used, it is easy for misrecognition to occur, and there are cases where the speech recognition technology cannot be used effectively.

That's why AmiVoice has taken the approach of creating a variety of speech recognition engines to support each usage scenario! AmiVoice has a large number of engine development engineers in-house, and there are an enormous number of speech recognition engines, some of which can be used on the AmiVoice Cloud Platform, a service for developers. You can see the specific lineup below.

acp.amivoice.com

This time, we will provide an overview of these engine types and explain how their speech recognition accuracy actually differs, using electronic medical record input voice as an example.

Currently, AmiVoice Cloud Platform offers three languages: Japanese, English, and Chinese, but this explanation will be based on Japanese.
Also, the explanation will be based on the lineup as of August 2021.

The difference between "Conversation" and "Voice Input"

First of all, the lineup on the above page has two types: "Conversation" and "Voice Input." These are the different speaking styles that the speech recognition engine supports. The differences are as follows:

  • conversation
    It mainly learns from human conversations, such as conferences and voice calls, and is able to handle mumbled speech to some extent.
  • Voice Input
    We are mainly learning the voices that people use to input information into PCs and smartphones, such as voice control and text input. It would be good to imagine a clear voice like that of an announcer.

Choosing a speech recognition engine that suits your intended use will likely improve speech recognition accuracy. Depending on how you use it and the situation, there may not be much difference, so it's a good idea to try both and compare them if possible.

What is a Domain-specific engine?

Both the "Conversation" and "Voice Input" above have "Conversation_General-Purpose" and "Voice Input_General-Purpose" at the top of the list, but these two are called "General-Purpose" engines, and all other engines are called "domain-specific" engines.

The main differences are as follows:

  • General-purpose engine
    It is trained on a wide range of language data, focusing on common words and phrases. It is a speech recognition engine that can handle almost anything for a variety of purposes.
  • Domain-specific engine
    It is trained to focus on the technical terms and phrases of each domain. It also does not train words and phrases that are unnecessary for that domain. It is a speech recognition engine specialized for each domain.

How are Domain-specific engines better than General-purpose engines?

The prices of General-purpose engines and Domain-specific engines differ, with Domain-specific engines generally being priced slightly higher.

The question here is, "How much of a difference is there in performance between a General-purpose engine and a Domain-specific engine?"

To investigate this, we conducted the following experiment.

Experiment content

Speech recognition accuracy was measured under the following conditions.

  • We created a script based on the electronic medical record data (which does not include personal information, customer information, or other important information) provided by our customers, such as hospitals and clinics, and used audio of our staff reading the script. Furthermore, the content and audio of this script were not used to train the speech recognition engine.
  • The audio data volume is 526 utterances, with an audio section of approximately 1990 seconds.*1
  • There are 10 male speakers and 9 female speakers. The content of each speaker's speech is different.
  • Two types of speech recognition engines were used: "Voice Input_General-Purpose" and "Voice Input_Electronic Medical Record" from AmiVoice Cloud Platform.
  • Misunderstandings caused by spelling variations have been corrected by revising the answer sentences.*2
  • Fillers (unnecessary words) are not included in the measurement of speech recognition accuracy.
  • There are parts in the audio where punctuation marks are spoken as "てん" and "まる", but these are not included in the measurement of speech recognition accuracy.

The method for calculating speech recognition accuracy is explained in the article below.

Regarding the "recognition accuracy (recognition rate)" of speech recognition


AmiVoice Cloud Platform-Tech Blog

Experimental result

The results were as follows.

■ Speech recognition accuracy (overall)

Engine Speech recognition accuracy
Voice Input_General-Purpose 87.41%
Voice Input_Electronic Medical Record 97.61%

Compared to the General-Purpose engine, the domain-specific engine "Voice Input_Electronic Medical Record" achieved extremely high accuracy.

■ Speech recognition accuracy (per speaker)

For reference, we have also listed the speech recognition accuracy by speaker.

Speaker Voice Input_General-Purpose Voice Input_Electronic Medical Record
Female 1 (50 utterances, approximately 150 seconds) 82.00% 95.41%
Female 2 (27 utterances, approximately 90 seconds) 97.11% 99.59%
Female 3 (24 utterances, approximately 80 seconds) 92.99% 97.22%
Female 4 (25 utterances, approximately 90 seconds) 89.20% 98.08%
Female 5 (25 utterances, approximately 80 seconds) 82.94% 98.85%
Female 6 (17 utterances, approximately 60 seconds) 91.30% 98.55%
Female 7 (25 utterances, approximately 80 seconds) 93.16% 98.41%
Female 8 (26 utterances, approximately 170 seconds) 90.33% 99.63%
Female 9 (26 utterances, approximately 110 seconds) 88.21% 98.13%
Male 1 (54 utterances, approximately 200 seconds) 79.08% 94.98%
Male 2 (25 utterances, approximately 90 seconds) 91.28% 98.96%
Male 3 (24 utterances, approximately 80 seconds) 91.63% 99.06%
Male 4 (24 utterances, approximately 130 seconds) 78.74% 95.38%
Male 5 (26 utterances, approximately 90 seconds) 81.51% 98.52%
Male 6 (25 utterances, approximately 90 seconds) 92.49% 98.66%
Male 7 (25 utterances, approximately 90 seconds) 82.06% 96.93%
Male 8 (26 utterances, approximately 100 seconds) 90.09% 96.96%
Male 9 (26 utterances, approximately 100 seconds) 87.97% 98.73%
Male 10 (26 utterances, approximately 110 seconds) 87.41% 97.47%

Here too, we can see that the Domain-specific engine is more accurate than the General-purpose engine for all speakers (however, it should be noted that the number of utterances per speaker and the audio duration are short).

The recognition accuracy of 97.61% is among the highest in our company. The reasons for this high figure are thought to be as follows:

  • The speech recognition engine matches the input voice content (we use a Domain-specific engine for electronic medical records to process the electronic medical record input voice with the content we expect).
  • Because this is an audio file intended for voice input, pronunciation is clear (pronunciation tends to be unclear when using audio from person-to-person conversations such as meetings).
  • You are speaking too close to the microphone (if the microphone is too far from your mouth, noise may be more likely to be picked up and the characteristics of your voice may change).
  • Speaking in a quiet indoor environment
What specific differences are there in the recognition results?

We have picked out some areas where there are differences in speech recognition accuracy.

■ Comparison of misrecognitions

  • Correct answer: 腎エコー・尿生化学
  • General-purpose engine: 腎エコー・行政改革
  • Specialized engine: 腎エコー・尿生化学
  • Correct Answer: パクリタキセル療法を週1回
  • General-purpose engine: パクリIt wasキセル両方種類書い
  • Specialized engine: パクリタキセル療法を週1回
  • Correct answer: 腫瘤内部へ流入する血流シグナルが検出され
  • General-purpose engine: 主流なイベ流入する血流シグナルが検出され
  • Specialized engine: 腫瘤内部へ流入する血流シグナルが検出され
  • Correct answer: テガフールウラシル配合
  • General-purpose engine: 手がHuluらしくCooperate
  • Specialized engine: テガフールウラシル配合

Overall, we can see that the specialized engine is strong in technical terms. Also, we can see that the General-purpose engine misrecognizes words with similar pronunciations, such as "尿生化学" being recognized as "行政改革".

At the end

This time, we used a test set of voice input into electronic medical records to measure the performance difference between a General-purpose engine and a Domain-specific engine. The more specialized the content, the greater the effect of a Domain-specific engine, so we hope you will consider it.

Additionally, as of August 2021, Domain-specific engines are available for "Medical," "Pharmaceutical," "Insurance," "Finance," "Electronic Medical Records," and other areas, but if you have any opinions or requests such as "I would like this type of specialized engine," please feel free to share them in the comments. While we cannot promise that we will be able to accommodate all requests, we will treat your input as valuable feedback for developing new engines.

Person who wrote this article

  •  

    Shogo Ando

    While researching speech recognition, I found a speech recognition company nearby and joined the company, where I continue to work to this day.

    My hobbies are traveling abroad, eating delicious food, and saunas.

*1:In normal internal evaluations, we use a larger amount of audio, but this time we used a small amount of data as it was a simple experiment.

*2:However, since this is based on my own visual inspection, there is a possibility that some misunderstandings may remain due to variations in notation.

Use API for Free