Tech Blog
  • HOME
  • Blog
  • How to adjust response time in real-time applications

How to adjust response time in real-time applications

Published: 2026.05.26 Last updated: 2026.05.25

In November 2025, multiple new features related to speech recognition processing were released for the AmiVoice API, and request parameters to use these features were added.


In that article, we introduced "recognitionTimeout" and "noInputTimeout," which are parameters for voicebots, among those available. This time, we will introduce other parameters that are useful in use cases such as real-time applications.
For information on how to specify parameters in speech recognition requests, please refer to the manual.

Time taken for speech recognition

In AmiVoice's speech recognition engine, generally speaking, the more time you spend on speech recognition, the higher the recognition accuracy, and the shorter the recognition time, the lower the accuracy (however, this doesn't mean that maximizing the recognition time will result in 100% recognition accuracy). The default settings of the AmiVoice API are configured to strike a good balance between this trade-off between "recognition processing time" and "recognition accuracy" in most cases, so the time it takes to recognize a given speech segment is usually about 0.5 to 1.5 times the length of that speech segment.
Furthermore, the content of the audio data also affects the time required for recognition processing. Roughly speaking, audio that is easy for humans to hear will be processed relatively quickly, while audio that is difficult for humans to hear, such as audio with low volume or a lot of background noise, tends to take longer to process.
Furthermore, if you try to use a speech recognition engine designed for Japanese to recognize English speech, for example, the engine will try to interpret the spoken words as Japanese, which can result in a very long recognition time. Moreover, in this case, the recognition results are almost useless.
Thus, the time required for speech recognition processing depends on factors such as the desired level of recognition accuracy and the content of the audio data. (In the case of the AmiVoice API, server congestion also plays a role.)

So, what happens if you make the recognition time shorter than the default? Actually, it doesn't necessarily mean the recognition results will become unusable due to a significant drop in accuracy. In fact, sometimes shortening the recognition time slightly does not greatly affect the overall meaning of the recognition result.
Now, let's consider the case where speech recognition is used in guidance robots in commercial facilities and the like.
If a customer asks a question to a robot by speaking, and the robot uses a generative AI to return an answer, the system could follow these steps: "Customer speaks" → "Speech recognition of the speech is performed" → "The speech recognition result is passed to the generative AI to create an answer" → "Answer is sent to the customer."
When interacting with a robot, customers experience waiting time between uttering a phrase and receiving a response. Therefore, to achieve more real-time interaction, it's desirable to minimize the time spent on speech recognition if possible. Furthermore, if post-processing is performed using generative AI, it may suffice to grasp the main idea rather than recognizing every single word with high accuracy.
In situations like this, the feature we're introducing today is useful for controlling the length of time spent on speech recognition to some extent. There are five parameters: "maxDecodingTime," "maxResponseTime," "maxDecodingRate," "targetResponseTime," and "targetDecodingRate," each of which controls the length of the recognition time in a different way.

Real-time recognition and batch processing: How to choose between "ResponseTime" and "DecodingRate"

Before introducing the five functions, let's first discuss the differences between "maxResponseTime" and "maxDecodingRate," and between "targetResponseTime" and "targetDecodingRate."
Simply put, the 'ResponseTime' series is for systems that perform speech recognition via streaming while recording in real time, and the 'DecodingRate' series is for systems that perform speech recognition on pre-recorded audio data in batch processing. The reverse combination will not work. This is because the relationship between the time required for speech recognition and the length of the audio data differs between real-time speech recognition and batch processing.
The "ResponseTime" parameters control the time spent on recognition processing in terms of "length of audio + arbitrary length (ResponseTime)". In systems that perform real-time recording and speech recognition, speech recognition processing is performed while sending audio to the API in small increments. Therefore, from the start of speech recognition to a certain point, the "time spent on speech recognition processing up to that point" will almost certainly be longer than the "length of audio sent to the API up to that point". Consequently, how much longer the recognition processing time is compared to the length of the audio is important.
On the other hand, the "DecodingRate" settings control the process based on the ratio of "time spent on recognition processing / length of audio" (DecodingRate). In batch processing, audio data is sent all at once. Therefore, if the audio is easy to recognize, the time required for recognition may be shorter than the length of the audio data, making the ratio to the length of the audio important.

With this in mind, we will now introduce each function and parameter.

Maximum recognition processing time: maxDecodingTime

You can set the maximum time to spend on speech recognition processing. The unit is milliseconds. This time is counted for each speech segment. This setting is valid for both real-time and batch processing.
As shown in the following formula, if the time spent on speech recognition for a given speech segment exceeds the value set in "maxDecodingTime", the speech recognition process for that segment will be forcibly terminated at that point.

Total recognition processing time taken so far >maxDecodingTimeTotal recognition processing time taken so far > maxDecodingTime

"maxDecodingTime" can be set to be shorter than the length of the speech segment you want to recognize, allowing you to create a very fast response time.
On the other hand, if "maxDecodingTime" is too short relative to the length of the utterance, the recognition process may only be completed up to a certain point in the utterance. For example, if "maxDecodingTime" is too short for the utterance "Advanced Media," the recognition result may only be obtained up to "Advanced."
Please note that in the case of real-time recognition, if "maxDecodingTime" is shorter than the length of the speech segment, recognition will only be possible up to a certain point in the speech segment.
Also, please note that if a single audio data file contains multiple speech segments of significantly different lengths, it is possible that "maxDecodingTime" may be too long for some speech segments but too short for others.

Maximum response time: maxResponseTime

You can set how much additional recognition time, in milliseconds, is allowed beyond the duration of the target audio. This setting applies only to real-time speech recognition.
As shown in the following formula, if the time spent on speech recognition for each utterance exceeds the sum of the length of the audio sent to the API and "maxResponseTime", the speech recognition process for that utterance will be forcibly interrupted at that point.

Total recognition processing time > Input audio duration + maxResponseTimeTotal recognition processing time > Input audio duration + maxResponseTime

With this feature, it is not possible to set the time spent on speech recognition to be less than the length of the speech. Furthermore, the function to forcibly interrupt the speech recognition process will only be activated after the speech recognition server has finished receiving all the audio data for that speech segment.
This feature can be used in applications that record audio in real time and stream it to an API, allowing the time between the end of speech and the receipt of a response to be kept within a specified timeframe (assuming there is no server or network congestion and audio can be transmitted in near real-time).

Maximum RT: maxDecodingRate

RT is a numerical value that represents the processing speed of speech recognition, and is calculated using the following formula.

RT=Time taken for recognition processingTime of input audioRT = \frac{Time taken for recognition processing}{Time of input audio}

The "maxDecodingRate" setting allows you to specify the maximum value for this RT.
In other words, as shown in the following formula, if the recognition processing time for a speech segment exceeds the audio duration multiplied by "maxDecodingRate", the speech recognition processing for that utterance will be forcibly interrupted at that point. Note that this is only effective for batch processing.

Total recognition processing time > Input audio duration × maxDecodingRateTotal recognition processing time > Input audio duration \times maxDecodingRate

The function to forcibly interrupt speech recognition processing will only be activated after the speech recognition server has finished receiving all the audio data for that speech segment.
With this feature, you can make the time spent on speech recognition processing longer or shorter than the length of the audio data. Also, since the recognition processing time is controlled as a percentage of the audio length, it is easier to achieve a balanced control even when there is a mix of long and short speech segments.
However, please note that if you set the value to too small, the recognition process may only be performed up to a certain point in the speech segment.
This feature can be used in applications that perform batch processing using pre-recorded audio data, where it is desirable to keep the speech recognition processing time for each utterance within a certain percentage of the audio length.

Target response time: targetResponseTime

This function sets a target value for how many seconds beyond the length of the audio data can be processed for speech recognition, and dynamically adjusts the balance between processing speed and recognition accuracy during speech recognition to meet that target value. This function is only effective for real-time speech recognition.
First, we calculate the "provisional RT" using the following formula.

Provisional RT=Recognition processing time taken so farTime of audio input so farProvisional RT = \frac{Recognition processing time taken so far}{Time of audio input so far}

This provisional RT will be adjusted to conform to the following formula.

Provisional RTTotal time of audio input so far + targetResponseTimeTime of audio input so farProvisional RT \approx \frac{Total\ time\ of\ audio\ input\ so\ far + targetResponseTime}{Time\ of\ audio\ input\ so\ far}

This function will not be activated if the "duration of audio input so far" is less than one second, or if the provisional RT is not 1 or greater, meaning that the recognition processing time taken so far is shorter than the duration of the input audio.
"maxResponseTime" is a function that sets an upper limit on the response time and terminates the process if that value is exceeded. Therefore, if the speech segment contains speech that takes a very long time to recognize (for example, if there is a lot of background noise), the upper limit on the response time may be reached midway through the recognition process of that speech segment, and the recognition result for that segment may not be obtained in its entirety. On the other hand, with "targetResponseTime", dynamic adjustments are made during the speech recognition process so that the response time is approximately the target value. Therefore, even if sections that take a long time to recognize are cut short, sections that can be recognized within the target response time are processed with little impact.
Therefore, "targetResponseTime" is a more versatile and user-friendly parameter. Furthermore, when performing speech recognition on the same audio with "maxResponseTime" and "targetResponseTime" set separately but with the same value, the one with "targetResponseTime" set tends to have better overall recognition accuracy.

Target RT: targetDecodingRate

This function sets a target ratio for recognition processing time relative to the audio duration, and dynamically adjusts the balance between processing speed and recognition accuracy during speech recognition to meet that target value. This function is only effective for batch processing.
First, as with the target response time, we calculate the "provisional RT" using the following formula.

Provisional RT=Recognition processing time taken so farTime of audio input so farProvisional RT = \frac{Recognition processing time taken so far}{Time of audio input so far}

This provisional RT will be adjusted to conform to the following formula.

Provisional RTtargetDecodingRateProvisional RT \approx targetDecodingRate

This function will not be activated if the "duration of audio input so far" is less than one second, or if the provisional RT is not 1 or greater, meaning that the recognition processing time taken so far is shorter than the duration of the input audio.
"maxDecodingRate" is a function that sets an upper limit for RT and terminates the process if that value is exceeded. Therefore, if the speech segment contains speech that takes a very long time to recognize (for example, if there is a lot of background noise), the RT may reach its upper limit during the recognition process of that speech segment, and the recognition result for that speech segment may not be obtained in its entirety. On the other hand, with "targetDecodingRate", dynamic adjustments are made during the speech recognition process so that the provisional RT is close to the target value. Therefore, even if accuracy is reduced to shorten the time for parts that take a long time to recognize, parts that can be recognized by the target RT are processed with little impact.
Therefore, "targetDecodingRate" is a more versatile and user-friendly parameter. Furthermore, when performing speech recognition on the same audio with "maxDecodingRate" and "targetDecodingRate" set separately to the same value, the one with "targetDecodingRate" set tends to have better overall recognition accuracy.

In what situations will it be used?

For example, suppose that while speech recognition normally takes 3 seconds, completing the process in 2.5 seconds would not significantly change the recognition result and would not greatly affect the behavior of the client system. In this case, as in the example of the guidance robot mentioned at the beginning, if you want to achieve more real-time interaction, completing speech recognition in 2.5 seconds could be an option.
Thus, in situations where achieving highly accurate transcription isn't necessarily important, and simply grasping the main idea of ​​the speech is sufficient, or where the goal is to obtain recognition results quickly and move on to the next step in the system, the parameters introduced here come into play.

Furthermore, the parameters introduced here also enable the system to return results without taking too long if the input audio is such that speech recognition would take too long.
Audio that takes a long time to process for speech recognition is often audio that is inherently difficult to recognize accurately due to factors such as excessive background noise, very low volume of the utterance to be recognized, poor sound quality, or the language not supported by the speech recognition engine. Therefore, even if a long time is spent on speech recognition, the recognition results are often not very good.
Spending too much time recognizing this type of speech is largely pointless, so using parameters to abandon speech recognition early can prevent the system from becoming sluggish.

Bonus: Examples of audio that slows down speech recognition processing

What kind of audio slows down speech recognition processing? What happens when you use parameters with that kind of audio?
Finally, I'll give you this example.
We have prepared a version of the sample audio file (test.wav) used in the quick start section of the manual with the volume reduced by 20dB. This is an audio recording of the utterance: "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。(Advanced Media aims to realize natural communication between people and machines and create a prosperous future.)"
This audio is streamed to the API in small increments, mimicking a real-time recording client system, for speech recognition. If the volume is too low, speech recognition will not work correctly, but in this case, the default settings were just barely sufficient for speech recognition.
Let's try setting "maxResponseTime=10" here. This is 10 milliseconds, or 0.01 seconds, so the speech recognition process will only be performed for almost the same amount of time as the length of the audio data.
The results were as follows:
"アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していく。(Advanced Media will create a prosperous future by realizing natural communication between people and machines.)"
Comparing it to the correct utterance, it's clear that the final part, "ことを目指します(We aim to do ...)," was not processed by speech recognition.
On the other hand, when the original sample audio was used and the same "maxResponseTime=10" setting was applied, speech recognition was completed successfully. In other words, reducing the volume resulted in a longer time required for speech recognition.
The data in this example was at a volume that was just barely sufficient for correct speech recognition, but if the volume is lowered further, even the default settings will no longer allow for correct speech recognition. If the processing takes a long time despite this, it's a waste of time. In such cases, using the parameters introduced here can help complete the process without wasting time.

Person who wrote this article

  • Covered in tea

    A person drinking tea while contemplating the potential of using voice for communication even when their hands are full.

     
Use API for Free