[OBS] Using AmiVoice to add real-time automatic subtitles to your streaming screen [And how to sync the subtitles with audio and video!]

S.R
One use case for speech recognition is real-time automatic subtitles during live streaming. In this article, we have created a sample program for live streamers active on YouTube Live, Twitch, etc. that outputs subtitles during live streaming by linking the speech recognition API (AmiVoice Cloud Platform) with OBS.

- 1. Overview
- 2. Sample program created this time
- 3. I tried it
- 4. Summary
- [Bonus] Real-time automatic subtitles and synchronized audio and subtitles on the stream
- Reference link
1. Outline

Shi-chan
The image character for "AmiAgent," a voice dialogue service provided by Advanced Media, Inc.
*This character's online activities are fictional.
I recently started streaming my games on YouTube! It's fun to see so many different people's reactions.

However, with live streaming, people who start watching halfway through can have trouble understanding what's being said...
Ah, but even if you miss the broadcast, you can catch up and play it back.*1It may be convenient to have subtitles when you pause a video, as you can immediately understand the content.*2.
Let's find out how to add subtitles to your live streams!
*The system introduced here is a speech recognition subtitling system for ACP (AmiVoice Cloud Platform) users. To use it, you will need to register as a user and obtain an APPKEY.
1.1. About ACP (AmiVoice Cloud Platform)
First, we will explain the AmiVoice Cloud Platform (hereinafter referred to as ACP), the speech recognition technology used for the real-time automatic subtitles introduced in this article.
ACP is a speech recognition API for software developers. Its features include not only a General-purpose speech recognition engine, but also "Domain-specific" speech recognition engines tailored to specific applications, and a word registration function that allows users to customize it to their own needs.
It also has a feature to automatically insert punctuation.*3
Therefore, by using ACP's speech recognition, it may be possible to effectively convey the atmosphere of a live broadcast through subtitles.
When talking about games or subculture, there are many different terms for different content, so it's nice to be able to create your own vocabulary.
For more information about ACP and how to use it, please see the following article:
AmiVoice Cloud Platform-Tech Blog
1.2.About OBS (Open Broadcaster Software)
For actual live streaming, you need an environment to deliver video and audio. That's where we introduce the software called "OBS", which is also important for this real-time automatic captioning.
OBS is open-source software for video recording and live streaming. It is an abbreviation for "Open Broadcaster Software," and is characterized by its compatibility with various streaming services *4 and the ability to freely customize streaming environments.
Due to its high level of customizability and quality, and the fact that it is free software, it is also known as the industry standard in the game streaming and so-called "e-Sports" communities.
Your favorite person and even famous streamers might be using OBS!
For more detailed instructions on how to use it, please see the link below.
This time, we will focus on real-time automatic subtitles using OBS.
The key feature that is important in this article is that OBS has a mechanism where you build your own streaming screen by combining "sources", which are "elements of the screen shown to viewers".
An example of an actual source in OBS is:
・Text source: Text data to be displayed on the distribution screen
・Video source: Video data to be displayed on the distribution screen
・Audio source: The audio data you want to stream (microphone, audio data on your PC, etc.)
In particular, the "text source" is important for this real-time automatic subtitle system.
(It's a bit confusing with the "source code" of a program...)
1.3. About speech recognition and real-time automatic subtitling
First of all, what exactly is "real-time automatic subtitling," the subject of this article?
This primarily refers to a system that uses speech recognition technology such as ACP, which was introduced at the beginning, to automatically transcribe text and display it as subtitles.
In particular, in personal streaming environments using streaming software such as OBS, the above system is almost always used.
(Very rarely, there are apparently live streams where volunteers manually transcribe and input subtitle text in real time... Amazing...)
In other words, the flow of the process is as follows: "The streamer's voice is recognized in real time, and the text data of the recognition results is displayed on the streaming screen."
1.4.Problems with real-time automatic subtitles on streaming systems such as YouTube Live
Real-time automatic subtitles using speech recognition as mentioned above may be available as a standard feature in streaming services such as YouTube Live *5.
However, there are two major problems with this feature.
1. Unexpected audio is also displayed as subtitles.
One problem with the standard real-time automatic subtitles used by streaming services is that sounds other than the streamer's voice are recognized and the results are displayed as subtitles.
I tried it out right away, and it certainly does seem to recognize the voices in the game and turn them into subtitles.
In many cases, a major factor is that the audio data used for speech recognition includes not only "the streamer's own voice (microphone audio)" but also unwanted audio for speech recognition such as "background music and PC audio".
2. The subtitle UI is pre-made
Another feature of standard real-time automatic subtitles on streaming services is that they require the use of the platform's pre-installed subtitle UI.
I'm particular about the live streaming screen, so I'd like to make detailed adjustments to things like fonts and colors for the subtitles and captions too, if possible...
However, the subtitle UI provided by streaming services such as YouTube Live allows viewers to freely switch the display of subtitles, which is an advantage in situations where viewers do not need subtitles.
2. Sample program created this time
In the previous chapter, we explained the speech recognition technology "ACP", the live streaming software "OBS", and real-time automatic subtitles using speech recognition.
This chapter explains the "ACP Live Captioning Sample", a real-time captioning system developed this time.
ACP Live Captioning Sample is a browser-based real-time automatic subtitling system for OBS using the ACP introduced at the beginning. By setting up the environment described below and simply operating through a web browser, you can relatively easily add subtitles to your OBS streaming screen.
*The actual source code can be downloaded from here.
System configuration
From here, we will explain the structure of the ACP Live Captioning Sample.
The following is a system configuration diagram.

We will explain the steps to display the speech recognition results in OBS using the system configuration diagram.
❶ First, record the streamer's voice in real time using the PC's microphone input on the "ACP Live Captioning Sample" web browser.
❷ The speech captured by the API used in the HTML file is sent to the speech recognition server, and the recognition results are obtained as text data.
❸ After that, the text data obtained through OBS-Websocket*6 is fed into OBS's Text (GDI+) source and displayed as subtitles.
I see, so you can display the text data generated by speech recognition as a source in OBS! Let's take a look at how to do it!
2.2 How to add real-time automatic subtitles to OBS (Download, Installation, and Settings)
From here, we will explain the steps to add real-time automatic subtitles to OBS using actual screenshots.
*Operation has been confirmed on Windows 10 and Google Chrome web browsers. Operation in other environments has not been confirmed.
❶ Install the following software.
・OBS (ver.25.0.8 or later)
https://obsproject.com/ja/download
・OBS-websocket plugin (ver.4.8 or later)
https://github.com/Palakis/obs-websocket/releases
*Depending on the version of OBS and the websocket plugin, it may not work properly, so please pay attention to the version when installing.
This phenomenon is explained in more detail in the link below.
❷ Start OBS and check the settings for the Websocket plugin.
- Select "WebSockets Server Settings" in the "ツール" tab of OBS.

・From "WebSockets Server Settings", set it as shown in the image. This time, uncheck "認証を有効にする".

❸ Add a text source that will display subtitle text on the OBS streaming screen.
- In OBS, click the "+" button under "ソース" and add "テキスト(GDI+)".

・Create a new source and set the name to "Acp". (This is because the default setting for the sample program we prepared this time specifies the name "Acp", so we need to match that.)
If there are no problems, just press OK to add the source for the subtitle text.

❹ Register as a user on ACP and prepare your APPKEY.
This is sample software for ACP users. An APPKEY is required to actually perform speech recognition.
You can find out more about registering an ACP and obtaining an APPKEY in this article.
AmiVoice Cloud Platform-Tech Blog
❺ Open the following link in your web browser. (The latest version of Google Chrome is recommended.)
*We are using obs-websocket-js, which is a JS wrapper for the Websockets API.
You will then see a page like the one below.

❻ Set the source name (configured as "Acp" in the article) for streaming subtitle text as shown in the image, as well as the ACP APPKEY. Add the information necessary for real-time automatic subtitle display in OBS.

❼ With OBS running, press the "OBSと接続" button.
*If the connection is successful, the button will light up red as shown in the image.

❽ When you press the "録音の開始" button, speech recognition will begin and subtitles will be displayed in OBS.
*If the message "このファイルが次の許可を求めています" appears, grant permission to use the microphone.
(Please adjust the position and size of the source where the subtitle text is displayed on the OBS side as appropriate.)

It seems that you can try out the real-time automatic subtitles right away by opening it in your web browser!
For information on setting up OBS-websocket and subtitles, the following links may be helpful.
https://mikune.com/yukarinatteconnector-obs-settings/#toc11
3. I tried it

Wow! What I said is actually showing up on the stream screen! Since it's an OBS stream screen element, it looks like the font and size can be freely customized. There's also an automatic line break feature based on character count, so there's no worry about the subtitle content getting cut off.
Since AmiVoice is used as the speech recognition source, it seems to reliably recognize punctuation marks and registered user-defined words as well.
However, it seems like there is a slight discrepancy between the audio and subtitles.
Do you want to customize it in various ways?
4. Summary
I looked into how to display the results of speech recognition using ACP as real-time automatic subtitles in OBS.
ACP offers several types of speech recognition engines, but for video streaming, I personally recommend "Conversation_General-purpose" engine! (As of November 2021)
If there is any demand from you all, we may consider developing a more accurate speech recognition engine specifically for video streaming.
Could your voice be used to create a "video streaming specialized engine"?
[Bonus] Real-time automatic subtitles and synchronized audio and subtitles on the stream
This article is aimed at people who have some experience with OBS, but it will show you how to synchronize video/audio sources with subtitles on your stream.
This is quite a powerful technique, so please use it as a reference only.
Real-time automatic subtitle delay
When you watch movies or TV shows with subtitles, the subtitles are probably displayed at a timing that is just right for easy reading.
However, for automatic subtitles that "use speech recognition to generate subtitles in real-time", such as those used in live broadcasts and web streaming, there is a delay (lag) until the recognition results are displayed.
Therefore, the content of these subtitles will be displayed slightly behind the audio and video.
It's true that in the stream, the subtitle content seems to be out of sync with the audio and video...
This [bonus section] introduces methods for improving the "lag between audio/video" that is characteristic of real-time automatic subtitles.
How video, audio and subtitles are synchronized
This system aims to eliminate delays in the display of subtitles by intentionally delaying (offsetting) the video and audio that are actually being streamed by a few seconds.
Let's set it up specifically on the OBS screen.
Audio source synchronization (sync offset)
In my environment, there was a delay of about 2.5 seconds between when someone spoke and when the subtitles appeared. Below, I will explain how to eliminate this 2.5 second delay.
❶ Click the gear icon for any audio input capture in the OBS "音声ミキサー" ⇒ Select "オーディオの詳細プロパティ".

❷ From "オーディオの詳細プロパティ", set the "同期オフセット" value to "2500ms (2.5 seconds)".

This will delay the audio stream by 2.5 seconds.
Video source synchronization (rendering delay)
❶ Right-click on any video source (screen capture in this case) in OBS's "ソース" menu. ⇒ Select "フィルタ".

❷ From "フィルタ", add "レンダリング遅延" using the "+" sign.
Set the "遅延時間" value of "レンダリング遅延" to "500ms (maximum value)".
By adding 5 of these, you will get an offset of 2500ms (2.5 seconds).
*Please delay by the same number of seconds as the synchronization offset of the audio source.
Now that the video, audio and subtitles are synchronized, you won't notice any misalignment of the subtitles during streaming!
Points to note regarding subtitle delay countermeasures
There are two points to note about the above subtitle delay countermeasures.
- In actual streaming, there will definitely be a delay from real time due to the synchronization offset.
This method intentionally introduces a delay due to a synchronization offset, so by its nature both the video and audio will be delayed by a few seconds from real time. Therefore, it is not recommended if low latency is important.
Well, a few seconds of delay in streaming is unavoidable to some extent, so it may not be a problem.
- The delay time may need to be changed depending on the distribution environment and the status of the speech recognition server.
The values used in the above settings (synchronization offset, rendering delay) are based on the delay times in the author's environment. The delay times may vary depending on the environment and situation.
To take more accurate measures to prevent subtitle delays, you need to know in advance the time (lag) it takes for the subtitles to appear, and then take the above measures to adjust to that time.
Reference link
Controlling OBS via Websockets from node.js | Kirimin-chan Note | note
Person who wrote this article
-

S.R
A third-year new graduate at Advanced Media, Inc.
I enjoy coming up with interactive content and useful ideas.
The person who illustrated this article
-

Banned cosmic particles
A person who draws pictures.
*1:Many live streaming systems have a catch-up playback function that allows viewers to rewind or resume viewing during a live stream. Example: Enable DVR for your live stream – YouTube Help
*2: With subtitled live streaming, you can expect benefits such as reduced risk of missing what is said, and the ability to understand the content even with the sound muted.
*3:This information is current as of November 2021. Google Speech to Text also allows you to insert punctuation marks depending on the settings.
*4:The most common one is YouTube, but it is supported by over 50 other streaming services as standard. Introducing a detailed guide to using and setting up OBS Studio [Just 4 key points!] | Even beginners can start streaming right away! | Esports PLUS
*5:Streaming platforms such as YouTube Live and Twitch come equipped with real-time automatic subtitles (known as "automatic transcription" on YouTube) as standard. For more information, click here Automatic captioning for live streams – YouTube Help
*6: By using an OBS plugin called "OBS-Websocket", you can control OBS from external services or other applications.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- Let's try using the speech recognition API "AmiVoice API"
- How to adjust response time in real-time applications
- How to use coupons for Zenn Spring 2026
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(28)
- Comparison and Verification (6)
- Others(10)


