AmiVoice API Update: End-to-End ASR–Ready ”Keyword Biasing”
AmiVoice API now supports user dictionary registration2 in its End-to-End ASR models1.
In this article, we provide an overview of the feature and explain how to use it.
User dictionary registration has been well received, but until now it was only available for hybrid speech recognition. User dictionary registration for end-to-end speech recognition poses technical hurdles, and is generally less effective than user dictionary registration for hybrid speech recognition. We will also explain the technical innovations that AmiVoice has made to improve this point.
End-to-end speech recognition is a powerful option when dealing with general-purpose vocabularies.
TL; DR
- In End-to-End speech recognition, the feature that allows you to register specific words you want the system to recognize reliably is referred to as “keyword biasing.”
- Although the operating principle and detailed behavior are different, it can be used for the same purpose as "word registration" in hybrid speech recognition.
- It can be used by registering only the Written.
- There is no concept of class
- The biasing level can be adjusted on a per‑word basis.
table of contents
- TL; DR
- table of contents
- Overview
- How does word registration differ between hybrid ASR and End‑to‑End ASR?
- How to use
- Written
- Alternative written
- Biasing level
- TIPS
- Specifying the reading for Alternative written
- Set the biasing level
- Register frequently misrecognized variants as Alternative writtens
- Do not assign desired output words as alternative writtens.
- Do not assign multiple Written forms to the same Alternative written.
- Do not assign different biasing levels to the same Written form.
- What should you do when classes cannot be used?
- Do not register words that are too short or have homonyms
- Avoid registering an excessive number of entries.
- So which approach performs better: End‑to‑End ASR or hybrid ASR?
- Reference
Overview
The word registration function used in conventional hybrid speech recognition is not possible with end-to-end speech recognition due to its structure.
Therefore, research and development into alternative functions using different methods is being actively carried out.
Advanced Media has also developed and is currently providing this technology. We refer to this feature as “keyword biasing” to distinguish it from word registration in hybrid ASR.
How does word registration differ between hybrid ASR and End‑to‑End ASR?
If you look at it from the perspective of "registering words to make them easier to find," it can be said to be exactly the same. On MyPage, it is also listed as a user dictionary registration in the same category as word registration. However, since the mechanisms are different, the usage is different.
To put it very simply, E2E speech recognition is a black box, so there are fewer types of information that can be used to influence score calculation.
Details
Let's take a closer look at what this means (it gets a little technical).

In speech recognition, the probability of multiple candidates is scored and compared to determine the final output, but both hybrid and E2E types intervene and upwardly revise the scores of candidates that contain registered words.
However, unlike the E2E type, the hybrid type scores using a collaboration of several modules. For example, the acoustic model + pronunciation dictionary that estimates the pronunciation and spelling of words from speech and the language model that estimates the validity of the word order are separate modules. For this reason, there is a phase in the processing process where reading information is explicitly handled, and the readings registered by the user can be used to influence the score calculation. Also, like AmiVoice, the language model has classes.3If you train the system, you can calculate scores that take into account the classes the user has registered.
On the other hand, E2E types are like all-in-one appliances that directly return text strings from voice input. They are not designed to explicitly handle information such as readings or classes, so they cannot be used to calculate scores.
This lack of opportunity to intervene can be a hurdle.
In the E2E type, a simple implementation that directly adds a bias to the scores of candidates that contain the specified string is common, and this is also the case with AmiVoice. As a result, it is easy for cases to occur where a word that has been carefully registered is not even considered as a candidate and is ineffective. In particular, since there is no abstraction through classes, it is necessary for the word to be pinpointed as a candidate. As will be explained later, a new setting value has been prepared to address this issue.
Additionally, research and development is underway on an improved version that addresses these weaknesses, and this will be available soon.
This is a rather abbreviated explanation,
How word registration works,Differences between hybrid and E2E systemsThis is explained in another article, so if you're interested, please take a look.
How to use
You can assign three items: “Written”, “Alternative written”, and “Biasing level”.
The only required parameter is the Written, and you can use it by simply setting the Written of the word you want to display and registering it. For Japanese, we recommend registering the reading in katakana as an Alternative written.

Written
Set the word representation you want to appear in the recognition results. This is the same as registering words for hybrid speech recognition.
Alternative written
You can specify a string here that is close in pronunciation to the string you want to recognize, and have it displayed in the format you want it to appear in. This is an optional item.
This is especially effective for foreign words and proper nouns. In the case of Japanese, it is effective to specify the reading kana as described below.
Note that, as described in the notes below, any string specified as an Alternative written will also be biasing(boosted). If this produces unintended behavior, consider adjusting the biasing level accordingly.
Supplemental
This is probably the parameter that most people are confused about.
If the string specified here is included in the recognition result candidates, the following two things will happen, which is why the above works.
- Increase the score of candidates that contain this string as in the Written
- This string is replaced with the expression
As mentioned in the explanation of how it works, keyword biasing will not work if the word was not originally considered as a candidate for recognition results. It is easier to understand if you think of it as a function to help with that.
Even if the character string registered in the Written is not one of the candidates, if there is another similar character string, you can pick it up, replace it with the biasing level, and output it. With only Written, you can only stick to one character string, but you can increase the number of results you can wait for.
Biasing level
Specify a numerical value that controls the strength of the bias applied to the speech recognition score. You can set a value between 0 and 1, and the higher the value, the more likely the word will appear in the recognition results. This setting is for each Written, and applies to any associated Alternative writtens. This is an optional item.
If you are unsure, you can leave it blank. If left blank, it will be treated as 0.5.
A value of 0 means no biasing.
A value of 1 is the maximum, but it does not guarantee 100% recognition. Instead, it applies a reasonably strong bias within a range that avoids unnatural recognition results.
TIPS
Set the reading for Alternative written
In the case of Japanese, specifying the reading kana here will make it easier to recognize uncommon words and coined words. Katakana is more effective, so we recommend using katakana.
(Originally, this was an item for setting "character strings that are close in pronunciation to the character string you want to recognize," rather than reading kana, but katakana written form also falls under this category, so it can be used in this way.)
Example:

Set the biasing level
If the registered word appears in unintended places, try lowering it; if the registered word is difficult to appear, try raising it.
Configure words that are repeatedly misrecognized as Alternative written.
If you register the reading as an Alternative written and adjust the biasing level, but it does not appear in the recognition results, it is possible that the Written and the reading are not being considered as candidates for the recognition results, and the keyword biasing is not working. In particular, if the same misrecognition occurs repeatedly, you can expect the keyword biasing to work by setting the string as an Alternative written instead.
When using this method, you can reduce the side effects by setting the keyword biasing to a low level.
Example:
For example, let's say you register the word "Ami Voice" but it isn't recognized and is instead misrecognized as "Amiboizu." In this case, you would register "Amiboizu" as an alternative spelling.

↓

Do not assign desired output words as alternative writtens.
Since Alternative writtens are replaced by Writtens, strings specified as Alternative writtens will generally not appear in the recognition results (replacement occurs regardless of keyword biasing). Therefore, do not specify strings that you might want to appear depending on the context.
Bad examples:
For example, if "モース硬度" is incorrectly recognized as "マウスコード," and you have set "マウスコード" as the Alternative written, even if you actually say "マウスコード," it will be recognized as "モース硬度."

Conversely, you can also use this to suppress words you don't want to appear. In this case, you only want to perform replacement, so it's a good idea to set the biasing level to a low value so that the words you don't want to appear aren't overboosted. However, this doesn't mean that they will not appear 100%.
Example:

Do not assign multiple Written forms to the same Alternative written.
If such a registration exists, it will be unclear which Written to convert the Alternative written to. In actual operation, only one Written will be valid, but it is best to avoid this as it will make the behavior difficult to predict.
Bad examples:

Do not assign different biasing levels to the same Written form.
The data structure is such that the Written is the basic unit of registration, with the biasing level and multiple Alternative writtens corresponding to it. For API and UI reasons, even if the Written is the same, each Alternative written is registered separately, but if different biasing levels are specified for each Alternative written, only one value will be used.
Bad examples:

Good example:

What should you do when classes cannot be used?
There is no corresponding function. In other words, the idea is that you don't have to do anything, and leave all the context and other decisions to the speech recognition decoder.
Do not register words that are too short or have homonyms
This is the same as with the hybrid model.
For example, if you specify the Alternative writtens "はし" and "橋", "橋" will also be displayed in cases such as "端", "箸", and "走る".
Avoid registering an excessive number of entries.
This is similar to the hybrid approach: aim for 1,000 words or less to start with, and aim for a maximum of a few thousand words.
After allWhich is better, a hybrid model?
If you want to use use cases or classes that involve a lot of technical terms or proper nouns, consider the hybrid engine mode.
Reference
- Keyword biasing official documentation
- Differences and features between hybrid speech recognition and end-to-end speech recognition
- Tips for registering words
- For details on the differences in specifications between the hybrid and end-to-end types, please click here.
https://docs.amivoice.com/en/amivoice-api/manual/engines/#difference-between-end-to-end-and-hybrid ↩︎ - With this release, we have changed the general term for these features to "User Dictionary Registration." The Hybrid "Word Registration" and E2E "Keyword Biasing" are collectively referred to as "User Dictionary Registration." ↩︎
- This refers to abstraction based on word types, such as people's names and place names.
https://acp.amivoice.com/blog/2022-01-13-101135/#%E5%8D%98%E8%AA%9E%E7%99%BB%E9%8C%B2%E3%81%AE%E3%82%AF%E3%83%A9%E3%82%B9%E3%81%A8%E3%81%AF ↩︎
Person who wrote this article
-

Mikiyasu Kobayashi
I am developing an application.
My interest in animal vocal communication led me to choose a speech recognition company.
I like observing nature, traveling, puzzles, games, etc. I feel like ChatGPT has become the person I chat with the most in my life.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New columns
- AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times
- AmiVoice API Update: End-to-End ASR–Ready ”Keyword Biasing”
- Easily synthesize subtitles into videos! Subtitle workflow created with speech recognition API
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(26)
- Comparison and Verification (5)
- Others(9)
