[We Tested It!] How does speech recognition accuracy change with sampling rate and compression rate?

"Won't the accuracy of speech recognition decrease unless the data is of high quality?"
"I want to reduce the file size, but how much compression is okay?"
Have you ever had any of these questions when using a speech recognition service?
Considering storage capacity and communication bandwidth, we want to keep data size small, but we don't want to sacrifice accuracy... To solve this problem, we actually tested the effect that sampling rate and compression rate have on speech recognition accuracy in the AmiVoice API.
If you want to know the optimal format for audio data, or if you're struggling to balance data capacity and recognition accuracy, we hope you'll find this article useful.
- Verification flow
- Test result 1: Comparison by sampling rate
- Test result 2: Comparison by compression ratio
- Summary
- appendix
Verification flow
This time, we will verify from the following two perspectives.
- Differences in speech recognition accuracy due to sampling rate
- Differences in speech recognition accuracy due to compression ratio
STEP 1: Prepare validation data
The data used for the verification is as follows:
We created the audio with diversity in mind to avoid a dataset that is biased towards specific speakers or content.
- A total of 10 patterns of speech content, including daily reports, news, meetings, readings, and instructions for voice input
- The speakers were two men and three women, ranging in age from their 20s to their 40s.
- To eliminate the influence of noise, all audio data is recorded in a clear environment.
| No. | Genre | Overview | Word count | Audio length |
|---|---|---|---|---|
| 1 | Daily Report | Lunch menu details | 165 letters | 27 seconds |
| 2 | News | Weather News | 159 letters | 29 seconds |
| 3 | Command | Voice command | 168 letters | 53 seconds |
| 4 | Company Profile | Company profile of Advanced Media, Inc. | 416 letters | 1 minutes 11 seconds |
| 5 | Parliament | City Council recordings | 477 letters | 1 minutes 19 seconds |
| 6 | Company Profile | Our Advanced Media, Inc. Outlook | 563 letters | 1 minutes 46 seconds |
| 7 | Literature | Story reading | 479 letters | 1 minutes 47 seconds |
| 8 | News | Weather News | 512 letters | 1 minutes 50 seconds |
| 9 | Parliament | Recording data of the Tokyo Metropolitan Assembly | 762 letters | 2 minutes 19 seconds |
| 10 | Literature | Prose reading | 827 letters | 2 minutes 45 seconds |
STEP 2: Speech recognition using AmiVoice API
Based on the audio data, we use the AmiVoice API provided by our company to perform speech recognition on audio data at various sampling rates and compression rates using curl commands.
curl https://acp-api.amivoice.com/v1/recognize \
-F d=-a-general \
-F u= {APP_KEY} \
-F a=@filename.wav | jq .text >> out.txt
{APP_KEY} specifies the APPKEY required to use the AmiVoice API. Also, since the speech recognition results from the API are output in JSON format, they are formatted using the jq command and output to the "out" file.
STEP 3: Calculating the results
We use a Python library called jiwer to measure the speech recognition accuracy and the correct answer data for each data set. For more information on how to use "jiwer", please refer to this article.
For evaluation, the average speech recognition accuracy of 10 patterns of speech data is measured using the following code.
import jiwer
refs = open("answer.txt", encoding="utf_8").read()
answers = open("outPCM48K.txt", encoding="utf_8").read()
output = jiwer.process_characters(refs, answers)
ave_accuracy = (1-output.cer)*100
print("-----------------結果-----------------")
print(f"平均音声認識精度: {ave_accuracy}")
Test result 1: Comparison by sampling rate
Sampling rate is an index used to express sound quality. For example, sampling rates of 8 kHz are used for telephones and 44.1 kHz for CDs. The higher the sampling rate, the more high frequencies can be expressed, so a high sampling rate is important for music.
On the other hand, there is a limit to the pitch of human speech, so for general speech recognition applications, such a high sampling rate is not required.
The optimal sampling rate for the AmiVoice API is 16kHz. When audio with a sampling rate higher than 16kHz is input, it is internally converted (downsampled) to sound quality equivalent to a sampling rate of 16kHz and then processed for speech recognition. Therefore, there is almost no difference in accuracy when recognizing data with a sampling rate of 48kHz or 16kHz.
Now, let's actually use data with sampling rates of 48 kHz and 16 kHz to verify whether there is any difference in speech recognition accuracy.
Targets for comparison
We compared audio data with sampling rates of 48kHz and 16kHz. The details of the audio data are as follows:
- Format: Uncompressed Wave data
- Bit depth: 16-bit
- Channel: Mono
Comparison result
The kbps and speech recognition accuracy for each sampling rate are as shown in the table below.
| Sampling rate 48kHz | Sampling rate 16kHz | |
|---|---|---|
| kbps | 768kbps | 256kbps |
| Average speech recognition accuracy (by micro-averaging) | 98.0% | 98.0% |
*kbps: Amount of data that can be processed per second
*Micro-average method: A method of calculating the average error rate by adding up the number of errors per character in all data sets and dividing by the total number of characters.
With the AmiVoice API, we have confirmed that increasing the sampling rate above 16kHz does not affect the accuracy of speech recognition.
Please note that the Amivoice API speech recognition engine is updated daily and the amount of calculation is adjusted slightly depending on the load on the speech recognition server, so sending the same voice data does not guarantee the same results. When conducting comparative testing, there may be slight errors in the speech recognition accuracy results.
Test result 2: Comparison by compression ratio
When designing an application, audio compression is extremely important to reduce communication data volume and storage capacity. However, excessive compression can degrade sound quality and affect recognition accuracy. So, what compression rate is necessary to maintain practical accuracy? We used this Opus to compare recognition accuracy by changing the compression rate (bit rate).
Targets for comparison
Using compressed audio data in Opus format, we compared six compression rates ranging from 6 kbps to 256 kbps.
Comparison result
The speech recognition accuracy for each kbps is shown in the table below. The compression rate for the original data is also listed.
| 256kbps | 128kbps | 64kbps | 32kbps | 16kbps | 6kbps | |
|---|---|---|---|---|---|---|
| Compression ratio | Approximately 1/1 | Approximately 1/2 | Approximately 1/4 | Approximately 1/8 | Approximately 1/16 | Approximately 1/43 |
| Average speech recognition accuracy (by micro-averaging) | 98.0% | 98.0% | 98.1% | 98.2% | 97.9% | 95.5% |
Consideration
At 6 kbps, accuracy was lower than at other compression rates, but no significant difference was observed at 16 kbps or above. This shows that speech recognition can be performed with sufficient accuracy even with a certain level of compression.
However, if the compression is so strong that it is difficult for the human ear to hear, it will affect recognition accuracy. We provide the following compression rate guidelines for each compression method.
- Speex: quality 7 or higher
- Opus: Compression ratio of about 1:10
Summary
This time, we used actual sample voice data to verify how speech recognition accuracy changes depending on the "sampling rate" and "compression rate."
With the AmiVoice API, when audio with a sampling rate higher than 16 kHz is input, it is downsampled to 16 kHz internally, so we were able to confirm that there is no difference in speech recognition accuracy even if audio data with an excessively high sampling rate is used.
We also found that a certain level of compression rate does not affect speech recognition accuracy. However, strong compression can affect recognition accuracy, so care must be taken. If you are using compressed audio data and are not getting the recognition accuracy you expect, please check the compression rate of your audio data.
We also recommend that you try it out using your own company's voice data, referring to the verification procedures we have introduced.
AmiVoice API offers a free trial, so please try it out and use it to reduce data size and improve speech recognition accuracy.
appendix
Below is a list of the file sizes of each audio file used in the verification.
| Sample Audio Data No. | File size (kbytes) 48kHz | File size (kbytes) 16kHz |
|---|---|---|
| 1 | 2,634 | 878 |
| 2 | 2,814 | 938 |
| 3 | 5,098 | 1,699 |
| 4 | 6,855 | 1,699 |
| 5 | 7,630 | 2,543 |
| 6 | 10,232 | 3,411 |
| 7 | 10,324 | 3,441 |
| 8 | 10,609 | 3,536 |
| 9 | 13,399 | 4,466 |
| 10 | 15,914 | 5,305 |
| Sample Audio Data No. | File size (kbytes) | |||||
|---|---|---|---|---|---|---|
| 256kbps | 128kbps | 64kbps | 32kbps | 16kbps | 6kbps | |
| 1 | 885 | 445 | 224 | 114 | 56 | 22 |
| 2 | 946 | 475 | 239 | 124 | 61 | 25 |
| 3 | 1,713 | 860 | 433 | 208 | 102 | 38 |
| 4 | 2,303 | 1,156 | 581 | 299 | 153 | 58 |
| 5 | 2,563 | 1,287 | 647 | 338 | 167 | 68 |
| 6 | 3,436 | 1,725 | 867 | 425 | 208 | 81 |
| 7 | 3,467 | 1,741 | 875 | 430 | 213 | 79 |
| 8 | 3,563 | 1,789 | 899 | 455 | 222 | 87 |
| 9 | 4,500 | 2,259 | 1,136 | 585 | 295 | 113 |
| 10 | 5,344 | 2,683 | 1,348 | 667 | 325 | 119 |
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use Zenn Coupon & Trial
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
