Differences and features between hybrid speech recognition and end-to-end speech recognition

Hayato Shibata
Advances in deep learning have dramatically improved the accuracy of speech recognition, and new speech recognition systems built entirely on neural networks have emerged. This article explains the advantages and disadvantages of new and traditional methods.
- The traditional "hybrid type" and the recently emerging "end-to-end"
- End-to-End is a simple voice recognition system
- End-to-End Weakness (Adaptation is difficult)
- Differences between hybrid and end-to-end manufacturing methods
- Is AmiVoice a hybrid type? End-to-End?
The traditional "hybrid type" and the recently emerging "end-to-end"
For a long time, the mainstream method of speech recognition was to use an acoustic model, a language model, and a pronunciation dictionary. The acoustic model is a hybrid type that combines a DNN (Deep Neural Network) and an HMM (Hidden Markov Model), so this speech recognition system will be called hybrid speech recognition. The acoustic model calculates an acoustic score for each time of speech, the pronunciation dictionary associates phonemes with words, and the language model calculates a linguistic score for word sequences. The acoustic score and linguistic score are weighted together and the highest score is output as the recognition result.

In contrast to this, end-to-end speech recognition, which uses a single neural network to recognize speech, has recently been gaining popularity. End-to-end speech recognition takes speech as input and outputs character or subword scores, with the character string (subword sequence) with the highest score being output as the recognition result.

For more specific details about the voice recognition system, please refer to our previous article.
AmiVoice Cloud Platform-Tech Blog
End-to-End is a simple voice recognition system
The hybrid type has a very complex mechanism because it decodes by combining individual modules. The training process involves many steps, such as phoneme clustering, creating phoneme alignment, and training the neural network, even just for the acoustic model. On the other hand, the end-to-end system is very simple, as it is a single neural network. Training does not require a pronunciation dictionary either, and a speech recognition system can be created simply by training the neural network using audio and its transcription. This gives the end-to-end system the advantage of being easier to develop.
End-to-End Weakness (Adaptation is difficult)
Both hybrid and end-to-end methods convert speech to text, so there's no problem using either for general-purpose recognition. However, if you want to use them for a specific task, they often can't recognize proper nouns or technical terms, or they don't produce the intended recognition results, so adaptation is required to enable recognition. In fact, many speech recognition services out there have adaptation and word registration functions, which are essential.
With hybrid speech recognition, only words included in the pronunciation dictionary appear in the recognition results, so if the word you want to recognize is not in the pronunciation dictionary, you must add it. Adding words is easy; just register the word and its pronunciation in the pronunciation dictionary. In addition to adding words, using a language model suited to the task is important for hybrid speech recognition, so collecting text data and adapting the language model can be expected to improve recognition accuracy. Text data is easy to collect, and language model adaptation can be done in a relatively short amount of time, making the hybrid type an easy method to adapt.

On the other hand, end-to-end systems do not have a pronunciation dictionary, so it is not easy to add words. Because it is a single neural network, it basically needs to be retrained using audio containing the words to be recognized and their transcriptions. Compared to hybrid systems, which can adapt using only text, requiring audio also makes data collection much more difficult. Furthermore, training a neural network takes much longer than adapting a language model. While active research is being conducted into these issues, they have not yet been put to practical use, making end-to-end a difficult method to adapt.
Differences between hybrid and end-to-end manufacturing methods
Data Preparation
End-to-end training requires only audio and its transcription, so the only training data preparation required is manual transcription. On the other hand, hybrid training requires not only transcribing the audio but also creating a pronunciation dictionary. Creating a pronunciation dictionary involves first defining a phoneme system and then assigning pronunciations to all words according to that phoneme system. While transcribing audio alone is possible for anyone who can hear the audio, creating a pronunciation dictionary is extremely time-consuming because it also requires linguistic knowledge.
Amount of data required
Generally, the amount of data required for training is greater for End-to-End, with hybrid types requiring hundreds of hours of data and End-to-End requiring thousands of hours of data. While hybrid types can perform task-specific (domain-specific) speech recognition, End-to-End types are often developed as general-purpose speech recognition because it is difficult to collect large amounts of speech from specific tasks.
Parameter adjustment
Hybrid models use a combination of acoustic models, language models, and pronunciation dictionaries for recognition, so it is necessary to find the optimal combination. This requires manual adjustment of many parameters, which requires considerable know-how. End-to-end models have the advantage of being a single neural network, meaning fewer parameters need to be manually adjusted and they are easier to optimize.
| Hybrid type | End-to-End | |
|---|---|---|
| Component | Acoustic model Language models Pronunciation dictionary |
Single Model |
| Use | General purpose Task-specific |
General purpose |
| Adaptation | easily | Difficult |
| Training Data | Several hundred hours ~ | Thousands of hours~ |
| Development of foreign language speech recognition | Difficult | easily |
| Parameter adjustment | Difficult | easily |
Is AmiVoice a hybrid type? End-to-End?
Hybrid and end-to-end speech recognition each have their advantages and disadvantages, but ease of adaptation is important when commercializing a product. In particular, speech recognition for individual customers requires that the intended recognition results be obtained. For this reason, Advanced Media's products still primarily use hybrid speech recognition, which is easy to adapt.
We have accumulated knowledge about Japanese in particular over many years, and the hybrid model, which allows for detailed parameter adjustments, is easier to reflect this knowledge. On the other hand, we are actively working on using end-to-end speech recognition for languages where we have little knowledge.
Advanced Media's research and development involves both hybrid and end-to-end technologies. By incorporating the model structure and learning methods used in end-to-end technologies, we have been able to improve the recognition accuracy of hybrid systems, leading to advances not only in new methods but also in conventional speech recognition systems.
Advanced Media will continue to use appropriate technologies depending on the situation and provide the voice recognition engine that it believes is most suitable.AmiVoice APIYou can use the latest engine for free for 60 minutes every month. Please give it a try.
Person who wrote this article
-

Hayato Shibata
I am currently researching and developing speech recognition.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
- AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
