Tech blog
  • HOME
  • Blog
  • Differences and features between hybrid speech recognition and end-to-end speech recognition

Differences and features between hybrid speech recognition and end-to-end speech recognition

Published: 2023.08.30 Last updated: 2025.03.04

Shibata Hayato Shibata

Advances in deep learning have dramatically improved the accuracy of speech recognition, and new speech recognition systems built entirely on neural networks have emerged. This article explains the advantages and disadvantages of new and traditional methods.

The traditional "hybrid type" and the recently emerging "end-to-end"

For a long time, the mainstream method of speech recognition was to use an acoustic model, a language model, and a pronunciation dictionary. The acoustic model is a hybrid type that combines a DNN (Deep Neural Network) and an HMM (Hidden Markov Model), so this speech recognition system will be called hybrid speech recognition. The acoustic model calculates an acoustic score for each time of speech, the pronunciation dictionary associates phonemes with words, and the language model calculates a linguistic score for word sequences. The acoustic score and linguistic score are weighted together and the highest score is output as the recognition result.

Hybrid Speech Recognition

In contrast to this, end-to-end speech recognition, which uses a single neural network to recognize speech, has recently been gaining popularity. End-to-end speech recognition takes speech as input and outputs character or subword scores, with the character string (subword sequence) with the highest score being output as the recognition result.

End-to-End Speech Recognition

For more specific details about the voice recognition system, please refer to our previous article.

A quick explanation of how speech recognition works!


AmiVoice Cloud Platform-Tech Blog

End-to-End is a simple voice recognition system

The hybrid type has a very complex mechanism because it decodes by combining individual modules. The training process involves many steps, such as phoneme clustering, creating phoneme alignment, and training the neural network, even just for the acoustic model. On the other hand, the end-to-end system is very simple, as it is a single neural network. Training does not require a pronunciation dictionary either, and a speech recognition system can be created simply by training the neural network using audio and its transcription. This gives the end-to-end system the advantage of being easier to develop.

End-to-End Weakness (Adaptation is difficult)

Both hybrid and end-to-end methods convert speech to text, so there's no problem using either for general-purpose recognition. However, if you want to use them for a specific task, they often can't recognize proper nouns or technical terms, or they don't produce the intended recognition results, so adaptation is required to enable recognition. In fact, many speech recognition services out there have adaptation and word registration functions, which are essential.

With hybrid speech recognition, only words included in the pronunciation dictionary appear in the recognition results, so if the word you want to recognize is not in the pronunciation dictionary, you must add it. Adding words is easy; just register the word and its pronunciation in the pronunciation dictionary. In addition to adding words, using a language model suited to the task is important for hybrid speech recognition, so collecting text data and adapting the language model can be expected to improve recognition accuracy. Text data is easy to collect, and language model adaptation can be done in a relatively short amount of time, making the hybrid type an easy method to adapt.

Hybrid speech recognition adaptation example

On the other hand, end-to-end systems do not have a pronunciation dictionary, so it is not easy to add words. Because it is a single neural network, it basically needs to be retrained using audio containing the words to be recognized and their transcriptions. Compared to hybrid systems, which can adapt using only text, requiring audio also makes data collection much more difficult. Furthermore, training a neural network takes much longer than adapting a language model. While active research is being conducted into these issues, they have not yet been put to practical use, making end-to-end a difficult method to adapt.

Differences between hybrid and end-to-end manufacturing methods

Data Preparation

End-to-end training requires only audio and its transcription, so the only training data preparation required is manual transcription. On the other hand, hybrid training requires not only transcribing the audio but also creating a pronunciation dictionary. Creating a pronunciation dictionary involves first defining a phoneme system and then assigning pronunciations to all words according to that phoneme system. While transcribing audio alone is possible for anyone who can hear the audio, creating a pronunciation dictionary is extremely time-consuming because it also requires linguistic knowledge.

Amount of data required

Generally, the amount of data required for training is greater for End-to-End, with hybrid types requiring hundreds of hours of data and End-to-End requiring thousands of hours of data. While hybrid types can perform task-specific (domain-specific) speech recognition, End-to-End types are often developed as general-purpose speech recognition because it is difficult to collect large amounts of speech from specific tasks.

Parameter adjustment

Hybrid models use a combination of acoustic models, language models, and pronunciation dictionaries for recognition, so it is necessary to find the optimal combination. This requires manual adjustment of many parameters, which requires considerable know-how. End-to-end models have the advantage of being a single neural network, meaning fewer parameters need to be manually adjusted and they are easier to optimize.

Hybrid type End-to-End
Component Acoustic model
Language models
Pronunciation dictionary

Single Model
(Neural Networks)

Use General purpose
Task-specific
General purpose
Adaptation easily Difficult
Training Data Several hundred hours ~ Thousands of hours~
Development of foreign language speech recognition Difficult easily
Parameter adjustment Difficult easily

Is AmiVoice a hybrid type? End-to-End?

Hybrid and end-to-end speech recognition each have their advantages and disadvantages, but ease of adaptation is important when commercializing a product. In particular, speech recognition for individual customers requires that the intended recognition results be obtained. For this reason, Advanced Media's products still primarily use hybrid speech recognition, which is easy to adapt.

We have accumulated knowledge about Japanese in particular over many years, and the hybrid model, which allows for detailed parameter adjustments, is easier to reflect this knowledge. On the other hand, we are actively working on using end-to-end speech recognition for languages ​​where we have little knowledge.

Advanced Media's research and development involves both hybrid and end-to-end technologies. By incorporating the model structure and learning methods used in end-to-end technologies, we have been able to improve the recognition accuracy of hybrid systems, leading to advances not only in new methods but also in conventional speech recognition systems.

Advanced Media will continue to use appropriate technologies depending on the situation and provide the voice recognition engine that it believes is most suitable.AmiVoice APIYou can use the latest engine for free for 60 minutes every month. Please give it a try.

Person who wrote this article

  • Hayato Shibata

    I am currently researching and developing speech recognition.

Use API for Free