A quick explanation of how speech recognition works!


Hello everyone!
This blog provides information about our company's speech recognition technology, but until now, there has actually been no article about the crucial "mechanism of speech recognition".
To be honest, if I were to try to properly explain how speech recognition works, it would take up an entire book.
So in this article, I would like to explain the mechanism of speech recognition in a rough and easy-to-understand manner.
Table of contents
- What is "speech recognition"?
- Acoustic Analysis
- Recognition Decoder
- How the hybrid recognition decoder works
- For more accurate speech recognition: Choosing the right engine
- Conclusion
- Acknowledgments
What is "speech recognition"?
"Speech recognition" generally refers to the technology that converts speech into text. The process of converting speech into text is broadly divided into two steps, as shown in the figure below.
- Extracting features from audio = Acoustic Analysis
- Using the feature as input, obtain the recognition result text = Recognition Decoder

Now let's take a look at each of these processes.
Acoustic Analysis
For example, even if it is the same sound "a", the waveform of the sound will change depending on the gender and age of the speaker, the microphone used for recording, etc. The image below shows the waveform of me saying "a" recorded with a laptop's built-in microphone and a headset microphone.*1。

If you look closely, you can see that there are differences in the waveforms even for the same "a". For this reason, instead of inputting the speech waveform data directly into the recognition decoder, the sound characteristics are quantified through acoustic analysis, and then the numerical value (Features) is input to the recognition decoder.
Specifically, the acoustic analysis is carried out in the following steps:
- Cut out the audio waveform into small sections (For example, 10 milliseconds. 10 milliseconds is 1/100th of a second)
- For each section of the extracted audio waveform (we call it a frame), we check and quantify the strength of each frequency.
A calculation of which frequencies of sound are how strong is called a spectrogram. Please refer to the figure below.

I visualized the spectrograms of the two types of "a" sounds mentioned earlier and the "i" sound recorded with a headset microphone. The horizontal axis of the spectrogram represents time, and the vertical axis represents frequency*2. The brighter the color, the stronger the sound at that frequency. Since the audio this time was recorded by elongating the sounds "あー" and "いー", there is almost no change in the spectrogram over time.
Earlier we mentioned that "even the same sound 'あ' has different waveforms", and when we create a spectrogram we can see that the positions of the bright areas are similar. We can also see that there is a difference in the positions of the bright areas between "あ" and "い". By analyzing the strength of each frequency, we can see the characteristics of the sounds "あ" and "い" themselves, regardless of the speaker or microphone, etc.
There are several ways to obtain features (numerical values) from this spectrogram, but since it would be complicated, we won't go into detail in this article. If you're interested, try searching for keywords such as "speech recognition features."
Recognition Decoder
The features converted from speech are input into a recognition decoder. Recognition decoders can be broadly divided into two types: "DNN-HMM Hybrid type (hereafter, Hybrid type)" and "End-to-End type". Before explaining each type, we will explain DNNs, which are used in both types.
What is DNN?
DNN (Deep Neural Network) is, as the name suggests, a "deep" "neural network." Neural networks originated as computer models of biological neural networks.

Each arrow in the diagram above represents a "connection", and the strength of the connections changes through data learning. The vertical red circles in the model on the left are called "layers", and a network that stacks many of these layers (making it deeper), as in the model on the right, to perform complex processing, is called DNN.
In speech recognition, this DNN is used as a "Classifier". In both Hybrid and End-to-End systems, the input is the acoustic features of speech obtained through acoustic analysis. The DNN determines which sound the features resemble (in phoneme units for Hybrid systems and in character or word units for End-to-End systems) and outputs the probability.
Hybrid type
This is a recognition decoder that combines the aforementioned DNN with HMM (Hidden Markov Model). It consists of three parts: ""Pronunciation dictionary", ""Acoustic model", and ""Language model", each of which has the following roles. We will explain in detail how each component functions, including an explanation of HMMs, in the later section "Mechanism of Hybrid Recognition Decoders".
- Pronunciation dictionary: Defines the phoneme sequences that represent each word
- Acoustic model: Calculate an acoustic score that indicates the probability of which word's speech the input feature quantity corresponds to
- Language model: Calculate a language score that indicates whether the sequence of words is linguistically natural
The process of searching for the "word sequence" with the highest score based on the acoustic score and language score to determine the recognition result is called decoding.

End-to-End Type
This is a new type of recognition decoder that has been actively researched in recent years. As shown in the image below, when speech features are input into a neural network, it outputs characters or word-pieces (sequences of characters that frequently appear in sentences) directly, rather than phonemes. Characters and words have a greater variety of types compared to phonemes, making them a difficult task to correctly identify. In recent years, advances in neural network research and computer technology have made it possible to construct and train neural networks with various innovations, and these are beginning to achieve results in speech recognition as well.

Compared to the Hybrid type diagram, the recognition decoder only contains a neural network. One of the advantages is that the structure is simple because the neural network does everything up to outputting the recognition result text, and there is no need to prepare a pronunciation dictionary, acoustic model, or language model separately.*3
The Hybrid type has the advantage of making it easier to customize the pronunciation dictionary and language model, so AmiVoice currently mostly uses the Hybrid type. The End-to-End type can also be customized, but it takes more time and effort than the Hybrid type. If these issues can be addressed in the future, the End-to-End type may become the mainstream for AmiVoice. When that happens, I hope to be able to explain it in detail in this blog!
How the hybrid recognition decoder works
Earlier, we explained that a hybrid recognition decoder is composed of three parts: "Pronunciation dictionary", "Acoustic model", and "Language model". From here, we will take a detailed look at how each of these parts functions.
Pronunciation dictionary
First, I will explain the "Pronunciation dictionary". The pronunciation dictionary links words such as "AMI" (an abbreviation of our company), "秋" and "紙" with their pronunciations and phonetic representations such as "a-m-i", "a-k-i", and "k-a-m-i". The pronunciation dictionary enables words to be represented as phoneme sequences (strings of phonemes). Conversely, this means that words not written in the pronunciation dictionary cannot be represented as phoneme sequences, and therefore will not appear in the recognition results.
| Words | Pronunciation | Phonemic transcription |
|---|---|---|
| AMI | あみ | a-m-i |
| 秋 | あき | a-k-i |
| 紙 | かみ | k-a-m-i |
| … | … | … |
Acoustic model
The Acoustic model calculates an acoustic score that represents the probability that the features obtained through acoustic analysis correspond to the phonetics of which word. As briefly mentioned in the explanation of the "Hybrid recognition decoder", DNN and HMM are used in combination here. Let's take a look at each of them.
What is HMM?
In speech recognition, HMMs are used as models to represent the "time series of phoneme changes within a word." See the diagram below.

Let's say you say the word "AMI" (pronounced "あみ" and notation in phonemes as "a-m-i"). In the explanation of acoustic analysis, we talked about "cutting out the speech waveform into small segments". One frame is a very short time of about 1 milliseconds (10/1 of a second), so if you say "あみ" for example.
10 frames for "a" → 5 frames for "m" → 15 frames for "i"
This will follow the time progression as shown below. When this is expressed using HMM, it becomes
- In the first frame, move from the left black circle to "a" (Black Arrow)
- Stay at "a" from the 2nd frame to the 10th frame (Blue arrow)
- Move from "a" to "m" in the 11th frame (Black Arrow)
- …
By expressing it this way, the effect of speech speed can be absorbed.
DNN in Acoustic Models
In the acoustic model, the feature vector*4 of one frame of speech obtained through acoustic analysis is input into the DNN. The DNN determines which phoneme the feature vector of that frame resembles and outputs the probability.

Combining DNNs and HMMs
We will explain how to combine DNN and HMM by speaking the word "あみ"(a-m-i).
For example, if the DNN output changes from a "state with high probability of 'a'" to a "state with high probability of 'm'" in an intermediate frame, the HMM can also be mapped to transition from the state "a" to the state "m". This mapping is performed up to the frame where the state "i" is estimated to end. Then, using the "probability indicating which phoneme each frame resembles" that has been output by the DNN up to that point, it is possible to calculate the probability that the word "AMI" was uttered. The value calculated based on this probability is called the acoustic score of that word.
However, the acoustic model itself does not know the correct answer as to what the speaker is saying. There are other words that could be candidates, such as "あき" (a-k-i) and "かみ" (k-a-m-i). The acoustic scores of these words are also calculated and they are set as candidates for the speech recognition result.
Language models
In the pronunciation dictionary shown in the example in the "Pronunciation Dictionary" section above, the only word with the phonetic transcription "a-m-i" is "AMI". However, there are generally several words with the phonetic transcription "a-m-i", such as "網" and "編み". How can we distinguish between these?
In such cases, "Context" is what humans naturally use to understand the meaning of spoken language. For example, if "あみ" appears in the sentence "あみのおんせいにんしき", it is likely to be "AMI", and if "あみ" appears in the sentence "あみでさかなをつかまえる", it is likely to be "網". In a Hybrid recognition decoder, it is the language model that plays the role of determining this "Context".
A (Japanese) language model is a model that determines how natural a sequence of words is as Japanese and assigns a language score. The more natural the sequence is as Japanese, the higher the language score becomes.
The "score" of this language model is calculated from the probability that a sequence of words appears.

The above graph shows the probability of the word that comes after the word sequence "メロス-は" in "走れメロス". If a language model were created using only the text of "走れメロス", the word sequence "メロス-は-激怒" would have a higher language score than the word sequence "メロス-は-単純".
Of course, the language model used in the actual speech recognition is not created solely from the text of "走れメロス". In order to cover as many Japanese expressions as possible, text from various genres, such as parliamentary minutes and news articles, is collected and statistically processed to create the language model. The text used is an enormous amount, with file sizes ranging from several gigabytes to tens of gigabytes, roughly equivalent to the equivalent of hundreds of thousands to millions of "走れメロス" (approximately 10,000 characters).
Decoding and Hypothesis
The process of searching for the "word sequence" with the highest score based on the acoustic score and language score obtained from the "Pronunciation dictionary," "Acoustic model," and "Language model," and using it as the recognition result is called decoding. This section explains an overview of the decoding process.
A hypothesis refers to all possible Japanese sentences that could be the content of that speech. And these become candidates for speech recognition results.
The current season may not be summer, but let's say, for example, that I said, "暑中お見舞い申し上げます". Naturally, "暑中お見舞い申し上げます" is one hypothesis, but at the same time, "こんばんは" or "板垣死すとも自由は死せず" could also be hypotheses. More plausible hypotheses might include "焼酎を振る舞い申し上げます" or "書中を見舞いも牛上げます". If we assume there are 100,000 words in Japanese, even limiting ourselves to 10-word sentences, there are 100,000^10 = 1050 possible hypotheses.
In ideal speech recognition, the acoustic score and language score for each word are added together for all of these hypotheses to calculate the score of each hypothesis, that is, the probability of matching the utterance "暑中お見舞い申し上げます". Then, the hypothesis with the highest score is taken as the recognition result.
The scores for the hypotheses above should look like the table below.
| Hypothesis | Acoustic score | Language Score | The reason |
|---|---|---|---|
| "焼酎を振る舞い申し上げます" | Low | High | Although it does not match the audio, it is a natural sentence in Japanese. |
| "書中を見舞いも牛上げます" | High | Low | The audio matches, but the sentence is not natural in Japanese. |
| "暑中お見舞い申し上げます" | High | High | The phonetics match and the sentence sounds natural in Japanese. |
Note that in reality, we do not score all 1050 hypotheses. This is because hypotheses with (relatively) very poor scores partway through are highly likely to lose to other hypotheses in the overall score, even if we calculate their scores all the way to the end. In this way, hopeless hypotheses such as "板垣…". Eliminating candidates at the preliminary stage is called pruning. In decoding, by performing this pruning appropriately, it is possible to produce recognition results in a realistic processing time.
For more accurate speech recognition: Choosing the right engine
To obtain highly accurate speech recognition results, it is important to use acoustic models and language models*5 that fit the speech and content of the utterance as closely as possible. For example, recording meeting audio with a microphone and recording call center conversations with a headset differ greatly in sound quality and recording environment. Additionally, the content of speech in a doctor's examination room and the content of speech by a legislator in a parliamentary session differ significantly in both the types of words that appear and their frequency. For this reason, AmiVoice creates multiple acoustic models and language models and provides them in combination.
In AmiVoice, this combination of acoustic model and language model is called an engine*6. Many AmiVoice products and APIs allow you to select an engine. For example, AmiVoice API provides the following engines, such as "Conversation_General-purpose" and "Medical_Voice Input."
We also offer engines with customized acoustic and language models to meet your needs. Using the right engine is key to maximizing the power of AmiVoice, so be sure to try out the engine that suits you best!
Conclusion
In this article, I have roughly explained how speech recognition works. Some people may be surprised at how short it is (I, the author, was surprised as well). Converting speech into text is something that humans take for granted, but it can only be achieved on a computer by combining a variety of technologies. I hope that I have been able to convey at least a little of the depth and excitement of this technology.
If you are a developer who has become interested in speech recognition technology or the AmiVoice API after reading this article, please try https://acp.amivoice.com/amivoice_api/. You can use all engines for free for up to 60 minutes of audio per month.
Thank you for reading this far!
Acknowledgments
I would like to express my gratitude to my senior colleague, Yoshitomo Kiso, for his advice in writing this article.
Person who wrote this article

Takashi Okura
He joined Advanced Media as a new graduate.
My current job mainly involves research and development to improve the accuracy of speech recognition.
My hobbies include traveling (mainly trains), reading (mainly novels), and board games.
*1: For recording audio and displaying waveforms and spectrograms, I used an open-source tool called Audacity ( https://www.audacityteam.org/ ).
*2: As you can see when you look closely at the numbers on the vertical axis, the intervals are wide in the low-frequency (low-pitched sound) sections and narrow in the high-frequency (high-pitched sound) sections. This is based on the characteristic of human hearing that is sensitive to frequency differences in low sounds while being insensitive to frequency differences in high sounds, and is called the mel scale. When converting speech waveforms into features, the mel scale is often taken into consideration, so the mel scale was adopted in the visualization figures.
*3: It is possible to combine language models, and some research has shown that this can improve recognition accuracy.
*4: In order to use information from before and after the relevant frame, feature values from several to several dozen frames may be input together.
*5: Because pronunciation dictionaries are often determined by language models, the two are sometimes collectively referred to as "language models."
*6: Depending on our products, different terms may be used, such as "engine mode" or "master dictionary."
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use Zenn Coupon & Trial
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
