"Speech segment ratio" as seen in operational data
Hello everyone!
I am an engineer responsible for SRE (Site Reliability Engineering) of the AmiVoice API.
One of the key features of the AmiVoice API is its "pay-only-for-speech" model. This means that you only pay for the portion of the audio you send to the API that is identified as a speech segment. While this is good for keeping costs down, it presents a challenge in that the actual cost varies considerably depending on the application and implementation, making cost estimation difficult. Ideally, it would be best to calculate the cost using actual operational data for each application or service, but there are many cases where such data is unavailable.
Therefore, in order to provide reference for preliminary estimates and designs, we investigated the distribution of "speech segment ratios" and their approximate values using actual operational data from the AmiVoice API.
Please refer to the following article for information on speech segments.
What is speech segment detection?
What is the speech segment ratio?
In this article, we will refer to the following indicator as the "speech segment ratio".
For this analysis, we focused on data from December 2025, and included the top 200 users for each of the AmiVoice API's most frequently used engines: the 8k engine and the 16k engine.
About 8K engines and 16K engines
Before looking at the analysis results, let's briefly summarize the differences between the 8K engine and the 16K engine.
The AmiVoice API selects an engine based on factors such as the audio sampling rate.
- 8K Engine: Primarily used for narrowband audio such as telephone voice messages. This engine is often used in applications such as call centers.
- 16k Engine: This is the standard engine used for wideband voice. It is used in a variety of scenarios other than call centers, such as conference audio, microphone input, face-to-face conversations, and voice input in factories.
Since the two are used in different contexts, it is expected that differences will appear in the proportion of speech segments. Therefore, we will examine the trends for 8k and 16k separately.
Survey 1: Distribution of Speech Segment Ratio
First, to see the overall trend, we compiled the average percentage of speech segments for each user and examined their distribution.
8k engine

16k engine

The 8K engine has a distinctive characteristic: the proportion of speech segments is concentrated around 50%, whereas in the 16K engine, it is more widely distributed.
The primary use of speech recognition in call centers where the 8K engine is used is to transcribe conversations between operators and customers into text. To improve accuracy, the AmiVoice API recommends sending the operator and customer voices as separate sessions rather than mixing them into a single session. As a result, while one channel is speaking, the other channel tends to be silent, which is thought to be why the speech segment ratio tends to be around 50%.
Additionally, there are a few cases where the speech segment ratio exceeds 80% in the 8k engine chart; these may be cases where only speech is being transmitted, such as with voicebots.
The 16k engine chart shows that while some users have a low speech segment percentage of 10-50%, many users fall within the 70-90% range. A low speech segment percentage indicates that the transmitted audio does not contain many spoken segments, and may include cases where users are transmitting longer recordings without precise control over the start and stop of recordings.
We will look into the 16k engine in more detail.
Survey 2: Detailed Analysis of the 16k Engine
In the 16k engine, we found that the distribution of speech segment proportions was wide. Therefore, we examined the distribution of speech segment proportions per session for each user and found that there were two main patterns.
Pattern A
The most common distribution is as follows. Below is an example from a single user. Unlike the previous section, the vertical axis represents the number of sessions.

A very large number of sessions have a speech segment ratio exceeding 97%. Furthermore, there is a tendency for the number of sessions to increase as the speech segment ratio rises.
This distribution suggests that the recording data is not being sent as is, but rather that some degree of segmentation or silence removal is being performed on the client side. For example, a smartphone app might only record while the voice button is pressed. At first glance, it might seem that such an implementation would result in a speech segment percentage of over 95%, but actual statistics show that a certain number of sessions with a low speech segment percentage are included. Therefore, when looking at the entire user base, it often falls within the 85-90% range, and it is considered realistic to use this range as a guideline for estimations.
Pattern B
Another type observed was a distribution without a clear bias, similar to Pattern A. The following is an example from a user.

In this user, the number of sessions tends to increase with a higher percentage of speech segments, but sessions with over 90% speech segments are fewer than in Pattern A. In particular, the difference from Pattern A is that the sessions are not strongly concentrated around 97%, and there is a fair amount of audio with a low percentage of speech segments.
The average distribution for this type of audio varied considerably. This is likely closer to cases where recorded audio is transmitted relatively directly. For example, this could be the case when audio is continuously recorded using a stationary microphone in a conference room and then processed. In such cases, the results vary greatly depending on the content of the conversation, the recording environment, and how the application is used. Therefore, based on the distribution of speech interval percentages per user seen in Survey 1, it seems reasonable to use around 70% as a guideline.
Summary
To answer the question, "How much time is subject to billing for speech recognition?", we investigated the distribution of speech segments using operational data from the AmiVoice API. This ratio varies greatly depending on the type of service. Please refer to the following guidelines based on the results of this investigation.
| Usage pattern | Guidelines for the proportion of speech segments | Supplemental |
|---|---|---|
| When sending call center conversations separately to the operator and the customer. | Around 55% | Common use cases for 8K engines |
| When the client performs segmentation and silence removal, and sends speech-centered audio | 85~90% | It's less likely to reach over 95% than you might imagine. |
| Other general applications | Around 70% | There is considerable variation depending on the application and implementation. In particular, if the client is continuously sending audio data regardless of whether audio is being transmitted or not, the actual cost may differ significantly from the estimate. |
Furthermore, the AmiVoice API automatically detects speech segments on the server side, and charges are only incurred for those segments. Therefore, there is no need to implement VAD on the client side solely for the purpose of reducing costs. On the other hand, it is effective for purposes such as improving usability and reducing data traffic. Please consider using VAD on the client side depending on your purpose.
Going forward, I hope to share the trends and insights we've gained through operating the AmiVoice API in a way that will be useful for decision-making regarding implementation and design.
Person who wrote this article
-

Yu Yu
I was drawn to the field of voice information processing and joined Advanced Media a few years ago. Currently, I'm in charge of the backend for the AmiVoice API while also promoting the use of AI within the company.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
- AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
