Tech blog
  • HOME
  • Blog
  • AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times

AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times

Published: 2026.03.27 Last updated: 2026.03.26

In November 2025, multiple new features related to speech recognition processing were released for the AmiVoice API, and request parameters for using them were added.
While the features and parameters are also described in the manual, since the new features are somewhat advanced, we believe there may be some people who are unclear about how to use them or what they are used for.
This article introduces "recognitionTimeout" and "noInputTimeout" among the new features and parameters released at this time, which are particularly useful in usage scenarios such as voicebots. For information on how to specify parameters in speech recognition requests, please refer to the manual.

To put it simply

The AmiVoice API typically performs speech recognition on the entire transmitted audio data (assuming no errors occur). Therefore, the longer the transmitted audio data, the longer the session will take to complete.
Also, even if the system is waiting for voice input or similar, if there is no utterance for a while, there may be cases where you want to stop waiting after a certain period and move on to the next step. In such cases, what you use is "noInputTimeout" for the "speech start waiting timeout".

Furthermore, if the system waits for voice input or other responses but receives no response for a while, you might want to stop waiting and move on to the next step. In such cases, you would use "noInputTimeout", which is a "speech start waiting timeout".

ParameterInterfaceUsage
recognitionTimeoutSynchronous HTTP / Asynchronous HTTP / WebSocketIf you want to perform speech recognition only on the first valid utterance to increase response speed
noInputTimeoutWebSocketIf you want to set an upper limit on the time to wait for speech to begin

Recognition completion timeout: recognitionTimeout

For example, in the case of a voicebot that uses speech recognition for telephone customer support, only the customer's spoken response to a question-and-answer format is needed, and only the speech recognition result of that response is used for subsequent system processing. There may be cases where speech recognition is not needed for audio data after the response has been spoken (such as silence, noise, or utterances unrelated to the response).

Normally, speech recognition processing is performed on the entire transmitted audio data. Therefore, even after the speech recognition processing of the customer's response is complete, the remaining audio is still processed, which may cause a slight delay before the session for this speech recognition request is completed.

This is where the recognition completion timeout function comes in handy.
The feature can be enabled by specifying a non-zero number for the "recognitionTimeout" request parameter. This number represents the timeout period (in milliseconds) before the feature activates.
Enabling this feature will result in speech recognition processing exhibiting two main behavioral patterns.

  • If speech recognition processing is successful for one speech segment before the timeout and recognition results (transcript data) are obtained, speech recognition processing will not be performed on the remaining audio data, and the session will be terminated.
  • If the speech recognition process is not completed within the time limit, the process is interrupted, the recognition results obtained up to that point are returned to the client, and the session is terminated.

This feature allows you to move on to the next step without waiting for the recognition process of any unnecessary audio data following the customer's response to finish, or, even if the recognition process takes a long time, to complete the process within a reasonable time and move on to the next step sooner.

"recognitionTimeout" is available for Synchronous HTTP, Asynchronous HTTP, and WebSocket interfaces. Its detailed behavior is as follows:

  • For Synchronous and Asynchronous HTTP interfaces, speech recognition processing is attempted for the duration specified in the 'recognitionTimeout' parameter (in milliseconds) after the speech recognition server receives the audio data. For WebSocket interfaces, speech recognition processing is attempted for the duration specified in the 'recognitionTimeout' parameter (in milliseconds) after the beginning of the first utterance is detected.
  • If speech recognition is completed and a recognition result is obtained for a given utterance during this time, the speech recognition process will not be performed any further, even if it is within the "recognitionTimeout" period.
  • If a speech segment is detected and speech recognition processing is performed, but no recognition result is obtained, such as when everything is recognized as filler words or noise and rejected, the speech recognition processing continues for the subsequent audio.
  • If a speech segment is detected and speech recognition processing begins, but the processing is not completed within the "recognitionTimeout" period, the speech recognition process will be interrupted, and the recognition results up to that point will be returned.
  • Even if a speech segment is detected and speech recognition processing is performed, if all of the processed speech segments are rejected within the "recognitionTimeout" period, no recognition result will be obtained.

For example, suppose a system that handles customer inquiries over the phone asks for the customer number, and the customer says the following:
Speech segment 1: えーっと(フィラー)
Speech segment 2: ○×◇番
Speech segment 3: ……これであってるかな

If, at this point, the entire speech segment 1 is recognized as filler and rejected, no recognition result is obtained. Therefore, speech recognition processing of speech segment 2 continues. Next, if a recognition result for speech segment 2 is obtained before the timeout, speech recognition processing of speech segment 3 is not performed. If the recognition processing of speech segment 2 is not completed by the timeout, the process is interrupted, and the interim recognition results obtained up to that point are sent to the client.

This function can be represented in the following diagram.

"recognitionTimeout" specifies a time in milliseconds. The default value is 0, which disables the recognition completion timeout feature.
Also, please note that the length of "recognitionTimeout" represents the time taken for the recognition process and is different from the length of the audio data.

If, when this function is enabled, the speech segment is interrupted in the middle of an utterance and you cannot obtain the complete recognition result you need, please try adjusting the parameters related to speech segment detection in the segmenterProperties of the request parameters, particularly the postTime.

Timeout while waiting for speech to begin: noInputTimeout

For example, in a question-and-answer voicebot, there might be cases where you want to skip a question and move on to the next one if the customer doesn't respond.

Normally, the session continues as long as the audio continues, even if no speech segment is detected from the incoming audio. In the case of the WebSocket interface, the connection is disconnected if no speech segment is detected for 600 seconds (Limitations).
The speech start timeout function is useful if you want to end the process before waiting 600 seconds.
The feature is enabled by specifying a non-zero number for the request parameter "noInputTimeout". The number specified here will also determine the timeout period (in milliseconds) before this feature takes effect.

"noInputTimeout" is only available in the WebSocket interface, and its behavior is as follows:

  • If the speech recognition server does not detect a speech segment within the time period (in milliseconds) specified by the "noInputTimeout" parameter after receiving the audio, an error is returned, and the session ends without speech recognition processing.
  • If the beginning of the first utterance is detected within the "noInputTimeout" period, the session will not be interrupted even after the "noInputTimeout" period has elapsed, and speech recognition processing will continue.

Furthermore, "noInputTimeout" can be used in conjunction with "recognitionTimeout". In this case, if the beginning of the first utterance is detected within the "noInputTimeout" time, the "recognitionTimeout" count will start from that point, allowing for a two-stage timeout system.

"noInputTimeout" specifies a time in milliseconds. The default value is 0, which disables the utterance start timeout feature.

Also, please note that, similar to "recognitionTimeout," the duration of "noInputTimeout" represents the time taken for the recognition process and is different from the duration of the audio data.

recognitionTimeout Development opportunity

Hopefully, you now have a better understanding of the new features I've introduced.
The "recognitionTimeout" recognition completion timeout function is a bit complicated, but in fact, it was developed based on a real-life problem experienced by one of our users.

In that user's client system, input was done via speech in a conversational format, but there were cases where the system would return an error even though speech had been spoken. This system had a mechanism in place to terminate the session if no speech recognition results were returned from the API within a certain period of time, but upon investigation, it was found that this timeout had been triggered in these cases.
When we examined the audio sent to the API at that time, we found that after the utterance used for input, there was also audio of the speaker having an unrelated conversation with other people. In other words, the addition of audio unrelated to the input increased the overall length of the audio data, causing the speech recognition to take longer than the system expected, resulting in a timeout.

So, what measures should be taken in response to such cases?
For example, one could adjust the client system's end-of-talk detection. However, if noise is detected as speech, or if unrelated speech begins immediately, the end-of-talk detection may not work properly. Conversely, there is a higher risk of the system incorrectly determining that the speech has ended even though the input speech is still in progress.
Of course, it's not realistic to ask end users not to speak unnecessarily immediately after inputting text.

Therefore, "recognitionTimeout" was developed so that even if unnecessary audio is sent, the AmiVoice API can discard anything other than what is deemed necessary.
In systems where a system interacts with a human, it is generally assumed that the speaker first makes an utterance that is a response to the system, and then any irrelevant utterances follow. Therefore, the system assumes that the first speech segment from which a recognition result is obtained is the utterance necessary for the system, and ignores subsequent utterances.
If a filler word such as "えーっと" is inserted before the answer is spoken, the entire speech segment that is recognized as a filler will be rejected, and no recognition result will be obtained. The recognition process will then continue to the next speech segment, so only the initial utterance excluding the filler will be obtained as the recognition result.
Furthermore, if an unrelated utterance immediately follows a response utterance, and the two become a single speech segment, causing the recognition process to take longer, the system will use a timeout function to interrupt the process even if recognition is still in progress, and return the intermediate results up to that point. Therefore, the returned recognition result may contain the necessary utterance portion.

If the client system simply sets a timeout, no recognition results will be obtained once the timeout is triggered. However, if the control is handled on the API side, any intermediate results that were processed before the timeout can be returned, which is an advantage.

Bonus: Recommended engines when using the AmiVoice API for voicebots

The two new features we've introduced today are expected to be particularly useful in use cases such as voicebots.
While we're on the topic, let me also mention speech recognition engines that are suitable for voicebots: in short, End to End engines are recommended.
End to End engines tend to be more accurate than Hybrid engines, especially for numbers, letters, and other short utterances. For example, if you want to recognize customer numbers, birth dates, or other questions that can be answered with a single word, an End to End engine is a strong choice.

Recently, the End to End engine has also become capable of using "keyword boosting", a feature similar to "word registration" in Hybrid engines (reference tech blog), so we hope you will give it a try.

Person who wrote this article

  • Covered in tea

    A person drinking tea while contemplating the potential of using voice for communication even when their hands are full.

     
Use API for Free