Tech blog
  • HOME
  • Blog
  • What does "confidence" mean when it appears in AmiVoice Cloud Platform's recognition results?

What does "confidence" mean when it appears in AmiVoice Cloud Platform's recognition results?

Published: 2022.04.11 Last updated: 2025.03.04

Takashi Okura Takashi Okura

Hello everyone!

When you perform speech recognition with AmiVoice Cloud Platform (ACP), the recognition results are returned in the JSON format shown below. If you look closely, you'll see that in addition to "written" and "spoken," several other pieces of information are also output.

{
    "results": [
        {
            "tokens": [
                {
                    “written”: “Advanced Media”,
                    "confidence": 1,
                    “starttime”: 570,
                    “endtime”: 1578,
                    “spoken”: “advanced media”
                },
                {
                    “written”: “is”,
                    "confidence": 1,
                    “starttime”: 1578,
                    “endtime”: 1850,
                    “spoken”: “is”
                },
                {
                    “written”: “,”
                    "confidence": 0.94,
                    “starttime”: 1850,
                    “endtime”: 2026,
                    “spoken”: “_”
                },

(Omitted)

                {
                    “written”: “goal”,
                    "confidence": 0.94,
                    “starttime”: 7722,
                    “endtime”: 7962,
                    “spoken”: “aim”
                },
                {
                    “written”: “to”,
                    "confidence": 0.94,
                    “starttime”: 7962,
                    “endtime”: 8490,
                    “spoken”: “to do”
                },
                {
                    “written”: “.”,
                    "confidence": 0.85,
                    “starttime”: 8490,
                    “endtime”: 8778,
                    “spoken”: “_”
                }
            ],
            "confidence": 0.999,
            “starttime”: 250,
            “endtime”: 8778,
            "tags": [],
            “rulename”: “”,
            “text”: “Advanced Media aims to realize natural communication between humans and machines and create a prosperous future.”
        }
    ],
    “utteranceid”: “20211130/17/017d6ffb057f0a301cca94c9_20211130_173421”,
    “text”: “Advanced Media aims to realize natural communication between humans and machines and create a prosperous future.”
    “code”: “”,
    “message”: “”
}

Among these, the most common question we receive from users is,confidence(Reliability)What is that?" The ACP website has an explanation at the link below.

acp.amivoice.com

confidence Confidence (value between 0 and 1. 0: low confidence, 1: high confidence)

Of course, this is true, but in this article I would like to explain in more detail about "confidence" in AmiVoice.
*Please note that the specific method for calculating reliability is confidential.

What does "high/low confidence" mean for speech recognition results?

In the speech recognition process, candidates for the text to be output from the input speech are predicted, and the one that is judged to have the highest probability is output as the recognition result. In other words, the speech recognition system contains "candidate information other than the recognition result."

For example, "weather" is output as a recognition result. In this case, the "candidates other than the recognition result" are the same sound, "turning point" and " which sounds similar.Electrical""Fine" These are candidates that are considered within the speech recognition system.
In this study, for example, "sound isCheerful", but from the context, it seems like "weather" Maybe...?", the voice recognition system may also be unsure of what to do. In such cases, the system may decide to just say "weather"I released it, but I'm not confident,"The reliability is lower.
On the other hand, "sound is alsoて ん き' and if you look at the context,weatherIf the voice recognition system judges that "is correct," then the user can confidently say "weather"It's true!"The reliability is higherHowever, the speech recognition system itself only predicts the text from the speech, and does not know the correct text. Just as humans can be overconfident and make mistakes,High confidence does not necessarily mean that the recognition results are correct.Please be aware of this.

If the reliability of the recognition results is low overall, it may be due to factors such as unclear speech or a lot of noise. As with measures for low speech recognition rates, it may be a good idea to speak more clearly or check the speaking environment.

There are two types of reliability

The reliability of AmiVoice isWord by word"When"utterance unitThere are two types:

① Word-level reliability

This is the confidence level that is output for each word in the recognition result.deficit OfconfidenceIs applicable.

② Reliability of utterance units

This is the confidence level for the entire utterance.purple characters Ofconfidenceis the value that is independent of the word-level confidence, as it is calculated differently.
As will be explained in detail in the next section, the confidence level for an utterance is primarily intended to be used to determine whether or not to execute a command when inputting a command (recognizing speech of a few words), and is not suitable for use when recognizing long sentences, etc. This is because the confidence level for an utterance is calculated for the entire utterance, and silent periods and filler parts such as "umm" and "erm" are included in the calculation, which may affect the confidence level.
When recognizing long sentences, or in other situations where silent periods or fillers are likely to occur, it may be better to use the confidence of each word and take the average of these as the "confidence of the entire utterance."

How to use reliability

How can this "confidence" be utilized in the speech recognition results? We will introduce some examples of how it is used in AmiVoice products.

① Highlight words that are likely to be misrecognized

This is an example of using word-level confidence. By changing the color and size of the text for words with a confidence level below a certain value, you can highlight words that are likely to be misrecognized.

As a specific example of use, we provideAmiVoice Ex7InA function that displays words that are likely to be misrecognized in redThis decision is based on word-level confidence.

f:id:amivoice_techblog:20211228102855p:plain
AmiVoice Ex7 voice input screen*1

Highlighting words that are likely to be misrecognized can be useful for correcting the recognition results later. It can also be used to prompt the user to repeat the part in question in the application.

② Use for command input etc.

An example of using utterance-based confidence is when recognizing only specific utterances, such as command input. It may be easier to understand if you imagine it being used like a smart speaker.

When inputting commands, it can be a problem if the command is executed automatically in response to ambient noise even when you are not saying anything.Commands will not be executed unless the speech recognition result has a certain level of confidence.This can prevent malfunctions.

However, care must be taken when setting this boundary value. Increasing the boundary value reduces the risk of malfunction, but increases the possibility that correct utterances will be missed and commands will not be executed. The opposite is true when the boundary value is set low; correct utterances will be missed less often, but the risk of malfunction will increase. There is a trade-off between the "likelihood of missing correct utterances" and the "risk of malfunction," and this is determined by the boundary value setting.

As of April 2022, we are currently in development, but we are planning to release a mode plan that will allow you to easily input commands like this in the ACP. Please look forward to this upcoming release!

Final thoughts

In this article, we explained the "confidence" displayed in ACP recognition results. Confidence is calculated using the recognition result and candidate information other than the recognition result. We hope you will make use of the confidence value in addition to the recognition result itself.

If any developers have become interested in voice recognition technology or the AmiVoice Cloud Platform after reading this article, please feel free to contact us. https://acp.amivoice.com/ Give it a try. You can use all engines for free up to 60 minutes of audio per month.

Thank you for reading this far!

Person who wrote this article

  • Takashi Okura

    He joined Advanced Media as a new graduate.
    My current job mainly involves research and development to improve the accuracy of speech recognition.
    My hobbies include traveling (mainly trains), reading (mainly novels), and board games.

     
Use API for Free