I implemented microphone recording in a Windows application. The first step in developing a voice recognition application!

Introduction
I am in charge of ACP-related development at Advanced Media, Inc. This time, I would like to implement microphone recording in a Windows application using C#.
完成形

Development environment
- Windows 10
- Visual Studio 2019
- WPF application
- .NET 5.0
Implementation
I would like to implement the app using the following steps:
- Registration for AmiVoice Cloud Platform (ACP)
- Project launch
- Adapting the MVVM model
- Acquires audio data from a connected microphone
- Speech recognition over WebSocket using ACP
- Linking the created program and UI
Step 1: Register for AmiVoice Cloud Platform (ACP)
To use streaming voice recognition, you first need to register with ACP.
For information on how to register, please see the article below.
Step 2: Launching the project
Open Visual Studio 2019, click "Create a new project", search for "WPF", and select "WPF Application".

Enter the project name, click "Next", and select the target framework ".Net 5.0".

Step 3: Adapting the MVVM model
WPF recommends a design pattern called MVVM (ModelView-View-Model), which is a software architecture that serves as a guideline for determining the internal structure of an application.
This timeThis blogWe will create an application with an internal structure that takes the MVVM pattern into consideration, using this as a reference.
-
- Delete MainWindow.xaml and MainWindow.xaml.cs and add Views, ViewModels, and Models folders. To add a folder, right-click on the project name and select Add to create a new folder. In the image, the project name is "RecApp" under "Solution."
- Add the MainView window class to the Views folder and the MainViewModel class to the ViewModels folder by right-clicking on each folder and adding them. For the MainView window class, select Window (WPF) and "Window1.xaml" and "Window1.xaml.cs" will be added. The file names are difficult to understand as they are, so change the file names to "MainView.xaml" and "MainView.xaml.cs". Also, at this time, add the MainView.xaml In the tag, add "x:Class="RecApp.Views. Window1Change the class name of Window1 and MainView.xaml.cs written as “” to MainView.

- Delete the StartupUri property written in App.xaml and override the OnStartup() method in App.xaml.cs with the following code.
App.xaml.cs
using System;
using System.Collections.Generic;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Threading.Tasks;
using System.Windows;
using RecApp.Views;
using RecApp.ViewModels;
namespace RecApp
{
///
/// Interaction logic for App.xaml
///
public partial class App : Application
{
protected override voidOnStartup (StartupEventArgs e)
{
base.OnStartup (e);
// ウィンドウをインスタンス化
MainView w = new MainView();
// ウィンドウに対する ViewModel をインスタンス化
MainViewModel vm = new MainViewModel();
// 閉じる際のイベントを設定
w.Closing = vm.Closing ;
// ウィンドウに対する ViewModel をデータコンテキストに指定
w.DataContext = vm;
// ウィンドウを表示
w.Show ();
}
}
}
Step 4: Acquire audio data from a connected microphone
ready
To obtain audio data from a connected microphone in C#, use a library called NAudio.
NAudio is a library that allows you to easily handle audio file-related processes such as audio input/output, device selection, and format conversion. The implementation of NAudio is also available on GitHub, so if you're interested,here .
First, to use NAudio, download the library from NuGet. Go to Tools > NuGet Package Manager > Manage NuGet Packages for Solution > Browse and search for "NAudio". Change the NAudio version to 1.10.0, check the target project, and install it.
*1.10.0 has enough functionality,Sound TouchWe are using 1.10.0 instead of the latest version because it is compatible with a library that allows you to change the playback speed and pitch in real time when playing audio files.
(WPF has a tag called MediaElement that plays video and audio, and you can change the playback speed with this as well, but there may be a period of silence when you change the playback speed. However, when using SoundTouch, there is no silence, so you can change the playback speed without any discomfort.)

Next, to create a class that performs NAudio-related operations, create a new class file in the Models folder (hereafter referred to as the Audio class). All processing using NAudio will be written in the Audio class.
There are two classes for recording audio using NAudio: the WaveIn class and the WaveInEvent class.
- Setting the microphone device to use for recording
- Processing the recorded audio
- Processing at the end of recording
It is a class that can do things like this. The internal processing of the WaveIn class and WaveInEvent class is different, but they can do the same things. So, how do you use the two classes?
WaveIn: GUI application
WaveInEvent: Console applications (can also be used in GUI applications)
The uses are divided as shown above. Also, the WaveIn class uses Windows Messages, but the WaveInEvent class is designed not to use Windows Messages. Since we are creating a GUI application this time, we will use the WaveIn class to write the recording process.
*The methods that can be used with the WaveIn and WaveInEvent classes are the same, so if you are creating the app we are creating this time using the WaveInEvent class, simply change the WaveIn part to WaveInEvent and the recording processing part will work.
Recording Processing
The code for recording is as follows:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using NAudio.Wave;
namespace RecApp.Models
{
/// <summary>
/// 録音状態を表す列挙型
/// <summary>
public enum RecordingState
{
Recording,
Stop,
Error
}
/// <summary>
/// NAudioを操作するクラス
/// <summary>
class Audio
{
// 録音を行うクラス
private WaveIn m_waveIn = null;
// 録音時のフォーマットを設定するクラス
private WaveFormat m_recordinFormat = null;
// 録音状態を管理する変数
public RecordingState m_recordingState { get; private set; }
public Audio ()
{
// サンプリング周波数 16000Hz 1ch 16bit PCMのフォーマット作成
m_recordinFormat = new WaveFormat (16000, 1);
}
/// <summary>
/// 録音開始
/// <summary>
public void RecordingStart ()
{
// 初期化&録音時のフォーマットを設定
m_waveIn = new WaveIn ();
m_waveIn.WaveFormat = m_recordinFormat;
// 録音デバイス設定
// デフォルトデバイスで録音
m_waveIn.DeviceNumber = 0;
// 録音中に発生するイベント
m_waveIn.DataAvailable += (_, ee) =>
{
try
{
// エラーが起きていないか
if (m_recordingState == RecordingState.Recording)
{
// 録音した音声データを処理する
}
}
catch (Exception e)
{
// エラー確認
if (m_recordingState == RecordingState.Recording)
{
// 二重に停止処理が起きないようにする
m_recordingState = RecordingState.Error;
// 録音停止
RecordingStop();
}
}
}
// 録音終了時のイベント
m_waveIn.RecordingStopped += (_, __) =>
{
// 録音状態が録音中・エラーの場合はインスタンスを解放する
if (m_recordingState != RecordingState.Stop)
{
// 録音状態変更
m_recordingState = RecordingState.Stop;
// WaveInインスタンス解放
m_waveIn.Dispose();
m_waveIn = null;
// 録音終了時に行う処理
}
}
// 録音開始
m_waveIn.StartRecording();
m_recordingState = RecordingState.Recording;
}
/// <summary>
/// 録音停止
/// <summary>
public void RecordingStop ()
{
// 録音停止
m_waveIn?.RecordingStop();
}
}
}
Let's look at each step. First, let's look at the constructor.
// サンプリング周波数 16000Hz 1ch 16bit PCMのフォーマット作成
m_recordinFormat = new WaveFormat (16000, 1);
This specifies the audio format for recording. In this example, only the sampling frequency and number of channels are specified, but you can also set the bit depth.
The process to be performed when recording starts is written in the RecordingStart method.
// 初期化&録音時のフォーマットを設定
m_waveIn = new WaveIn ();
m_waveIn.WaveFormat = m_recordinFormat;
// 録音デバイス設定
// デフォルトデバイスで録音
m_waveIn.DeviceNumber = 0;
specifies the recording device and format required for recording. Also, a new WaveIn class is instantiated each time recording is performed. This is because when the same instance of the WaveIn class is used, the "WAVERR_STILLPLAYING" Windows Message is sent by "waveInUnprepareHeader," one of the Win32 APIs used internally, and the recording stop process is executed as soon as recording starts (the recording stop process is executed if any error occurs). To avoid this, a new instance is created each time.
You can write the processing to be performed on the recorded audio data in DataAvailable.
*DataAvailable is an event that fires when the buffer inside WaveIn is filled with queue. By default, this event fires every 100ms.
// 録音中に発生するイベント
m_waveIn.DataAvailable += (_, ee) =>
{
try
{
// エラーが起きていないか
if (m_recordingState == RecordingState.Recording)
{
// 録音した音声データを処理する
}
}
catch (Exception e)
{
// エラー確認
if (m_recordingState == RecordingState.Recording)
{
// 二重に停止処理が起きないようにする
m_recordingState = RecordingState.Error;
// 録音停止
RecordingStop();
}
}
}
A try-catch statement is used to stop recording if an error occurs while processing the recorded data.
You can write the processing to be performed when recording stops in RecordingStopped.
// 録音終了時のイベント
m_waveIn.RecordingStopped += (_, __) =>
{
// 録音状態が録音中・エラーの場合はインスタンスを解放する
if (m_recordingState != RecordingState.Stop)
{
// 録音状態変更
m_recordingState = RecordingState.Stop;
// WaveInインスタンス解放
m_waveIn.Dispose();
m_waveIn = null;
// 録音終了時に行う処理
}
}
RecordingStopped is the last process to be performed, so the instance is released here.
*After using the RecordingStop method, the WaveIn class processes the following in order: flag change (recording stopped) → DataAvailable (unprocessed) → RecordingStopped.
Recording is started with the StartRecording method of the "WaveIn" class, and stopped with the RecordingStop method.
// 録音開始
m_waveIn.StartRecording();
m_recordingState = RecordingState.Recording;
/// <summary>
/// 録音停止
/// <summary>
public void RecordingStop ()
{
// 録音停止
m_waveIn?.RecordingStop();
}
Step 5: Speech recognition over WebSocket using ACP
ready
The WebSocket part of speech recognition via WebSocket using ACP is described in this article, so this timeSample program on the ACP homepageI would like to incorporate the WebSocket part of this into my app as is.
Real-time speech recognition on Apple Watch using WebSocket + AVAudioEngine
AmiVoice Cloud Platform-Tech Blog
First,hereDownload the sample program from here.
Copy the "com" folder in the sample program's "sample_1.1.8/Wrp/cs/src" into the Models folder.
*The "Wrp" file in the "com" folder describes the processing related to WebSocket. Therefore, please refer to it if you are writing your own processing related to WebSocket.
Next, add a new class file to the Models folder that describes the speech recognition results via WebSocket (hereafter referred to as the WrpSimple class). All processing of the speech recognition results returned via WebSocket will be written in this WrpSimple class.
*In this application, we only parse the returned speech recognition results and display the UI, but there are various other events (for example, an event that fires when a speech segment is detected). Use each event according to the application you want to create.
Recognition result processing
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace RecApp.Models
{
/// <summary>
/// 認識結果を扱うクラス
/// </summary>
class WrpSimple : com.amivoice.wrp.WrpListener
{
public void utteranceStarted(int startTime) { }
public void utteranceEnded(int endTime) { }
public void resultCreated() { }
public void resultUpdated(string result) { }
public void resultFinalized(string result)
{
string text = TextParse(result);
}
public void eventNotified(int eventId, string eventMessage) { }
public void TRACE(string message) { }
/// <summary>
/// 認識結果を変換
/// </summary>
/// <param name="result">JSON形式の認識結果文字列</param>
/// <returns>認識結果
private string TextParse(string result)
{
int index = result.LastIndexOf(",\"text\":\"");
if (index == -1)
{
return null;
}
index += 9;
int resultLength = result.Length;
StringBuilder buffer = new StringBuilder();
int c = (index >= resultLength) ? 0 : result[index++];
while (c != 0)
{
if (c == '"')
{
break;
}
if (c == '\\ ')
{
c = (index >= resultLength) ? 0 : result[index++];
if (c == 0)
{
return null;
}
if (c == '"' || c == '\\' || c == '/')
{
buffer.Append((char)c);
}
else
if (c == 'b' || c == 'f' || c == 'n' || c == 'r' || c == 't')
{
}
else
if (c == 'u')
{
int c0 = (index >= resultLength) ? 0 : result[index++];
int c1 = (index >= resultLength) ? 0 : result[index++];
int c2 = (index >= resultLength) ? 0 : result[index++];
int c3 = (index >= resultLength) ? 0 : result[index++];
if (c0 >= '0' && c0 <= '9') { c0 -= '0'; } else if (c0 >= 'A' && c0 <= 'F') { c0 -= 'A' - 10; } else if (c0 >= 'a' && c0 <= 'f') { c0 -= 'a' - 10; } else { c0 = -1; }
if (c1 >= '0' && c1 <= '9') { c1 -= '0'; } else if (c1 >= 'A' && c1 <= 'F') { c1 -= 'A' - 10; } else if (c1 >= 'a' && c1 <= 'f') { c1 -= 'a' - 10; } else { c1 = -1; }
if (c2 >= '0' && c2 <= '9') { c2 -= '0'; } else if (c2 >= 'A' && c2 <= 'F') { c2 -= 'A' - 10; } else if (c2 >= 'a' && c2 <= 'f') { c2 -= 'a' - 10; } else { c2 = -1; }
if (c3 >= '0' && c3 <= '9') { c3 -= '0'; } else if (c3 >= 'A' && c3 <= 'F') { c3 -= 'A' - 10; } else if (c3 >= 'a' && c3 <= 'f') { c3 -= 'a' - 10; } else { c3 = -1; }
if (c0 == -1 || c1 == -1 || c2 == -1 || c3 == -1)
{
return null;
}
buffer.Append((char)((c0 << 12) | (c1 << 8) | (c2 << 4) | c3));
}
else
{
return null;
}
}
else
{
buffer.Append((char)c);
}
c = (index >= resultLength) ? 0 : result[index++];
}
return buffer.ToString();
}
}
}
Make the WrpSimple class inherit the WrpListener interface that exists in "com.amivoice.wrp".
/// <summary>
/// 認識結果を扱うクラス
/// </summary>
class WrpSimple : com.amivoice.wrp.WrpListener
By inheriting the interface, you must implement the following seven items:
public void utteranceStarted(int startTime) { }
public void utteranceEnded(int endTime) { }
public void resultCreated() { }
public void resultUpdated(string result) { }
public void resultFinalized(string result)
{
string text = TextParse(result);
}
public void eventNotified(int eventId, string eventMessage) { }
public void TRACE(string message) { }
In this case, we only use resultFinalized because we want to display the UI when speech recognition is complete. resultFinalized is an event that fires when the speech recognition result for a specific speech section is confirmed. The speech recognition result is a string in JSON format. Therefore, we need to parse it and extract only the necessary parts.
The "TextParse" method extracts only the recognition results from the JSON-formatted string.
/// <summary>
/// 認識結果を変換
/// </summary>
/// <param name="result">JSON形式の認識結果文字列</param>
/// <returns>認識結果
private string TextParse(string result)
{
int index = result.LastIndexOf(",\"text\":\"");
if (index == -1)
{
return null;
}
index += 9;
int resultLength = result.Length;
StringBuilder buffer = new StringBuilder();
int c = (index >= resultLength) ? 0 : result[index++];
while (c != 0)
{
if (c == '"')
{
break;
}
if (c == '\\ ')
{
c = (index >= resultLength) ? 0 : result[index++];
if (c == 0)
{
return null;
}
if (c == '"' || c == '\\' || c == '/')
{
buffer.Append((char)c);
}
else
if (c == 'b' || c == 'f' || c == 'n' || c == 'r' || c == 't')
{
}
else
if (c == 'u')
{
int c0 = (index >= resultLength) ? 0 : result[index++];
int c1 = (index >= resultLength) ? 0 : result[index++];
int c2 = (index >= resultLength) ? 0 : result[index++];
int c3 = (index >= resultLength) ? 0 : result[index++];
if (c0 >= '0' && c0 <= '9') { c0 -= '0'; } else if (c0 >= 'A' && c0 <= 'F') { c0 -= 'A' - 10; } else if (c0 >= 'a' && c0 <= 'f') { c0 -= 'a' - 10; } else { c0 = -1; }
if (c1 >= '0' && c1 <= '9') { c1 -= '0'; } else if (c1 >= 'A' && c1 <= 'F') { c1 -= 'A' - 10; } else if (c1 >= 'a' && c1 <= 'f') { c1 -= 'a' - 10; } else { c1 = -1; }
if (c2 >= '0' && c2 <= '9') { c2 -= '0'; } else if (c2 >= 'A' && c2 <= 'F') { c2 -= 'A' - 10; } else if (c2 >= 'a' && c2 <= 'f') { c2 -= 'a' - 10; } else { c2 = -1; }
if (c3 >= '0' && c3 <= '9') { c3 -= '0'; } else if (c3 >= 'A' && c3 <= 'F') { c3 -= 'A' - 10; } else if (c3 >= 'a' && c3 <= 'f') { c3 -= 'a' - 10; } else { c3 = -1; }
if (c0 == -1 || c1 == -1 || c2 == -1 || c3 == -1)
{
return null;
}
buffer.Append((char)((c0 << 12) | (c1 << 8) | (c2 << 4) | c3));
}
else
{
return null;
}
}
else
{
buffer.Append((char)c);
}
c = (index >= resultLength) ? 0 : result[index++];
}
return buffer.ToString();
}
This process is a direct copy and paste of the text_ method in "WrpTester.cs" in the sample program "sample_1.1.8/Wrp/cs".
Step 6: Linking the created program and UI
ready
We will now link the voice recording and voice recognition result processing created in steps 4 and 5, the class ported from the sample program, and the UI.
First, we will create the UI. We will not discuss UI design this time. Therefore, the app will work with just a WPF TextBox and Button. If you do not want to worry about the UI, copy and paste the following XAML code into "MainView.xaml".
*You will need to change all of the RecApp parts in the code to the name of the project you created.
<Window x:Class="RecApp.Views.MainView"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:local="clr-namespace:RecApp.Views.Behavior"
mc:Ignorable="d"
Title="録音アプリ" Height="500" Width="300">
<Window.Resources>
<!-- Visibilityをbool値で変換出来るようにするためのコンバータ -->
<BooleanToVisibilityConverter x:Key="BoolVisibilityConverter"/>
<!-- 点滅アニメーション -->
<Storyboard x:Key="BlinkStory">
<DoubleAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Opacity)" RepeatBehavior="Forever" AutoReverse="True">
<LinearDoubleKeyFrame KeyTime="0" Value="1"/>
<LinearDoubleKeyFrame KeyTime="0:0:1" Value="0"/>
</DoubleAnimationUsingKeyFrames>
</Storyboard>
<!-- 点滅のためのスタイルの作成 -->
<Style x:Key="BlinkingStyle" TargetType="TextBlock">
<Style.Triggers>
<Trigger Property="IsVisible" Value="True">
<Trigger.EnterActions>
<BeginStoryboard x:Name="BlinkingStoryboard1" Storyboard="{StaticResource BlinkStory}"/>
</Trigger.EnterActions>
<Trigger.ExitActions>
<StopStoryboard BeginStoryboardName="BlinkingStoryboard1"/>
</Trigger.ExitActions>
</Trigger>
</Style.Triggers>
</Style>
</Window.Resources>
<Grid>
<Grid.RowDefinitions>
<RowDefinition Height="*"/>
<RowDefinition Height="40"/>
</Grid.RowDefinitions>
<Grid Grid.Row="0" Background="LightGray" Panel.ZIndex="2" Opacity="0.5"
Visibility="{Binding BlinkStory, Converter={StaticResource BoolVisibilityConverter}}">
<TextBlock Foreground="Black" FontSize="60" TextAlignment="Center" VerticalAlignment="Center"
Visibility="{Binding BlinkStory, Converter={StaticResource BoolVisibilityConverter}}"
Style="{StaticResource BlinkingStyle}">
録音中
</TextBlock>
</Grid>
<TextBox Grid.Row="0" Text="{Binding RecognitionResultText}" Panel.ZIndex="1"
VerticalScrollBarVisibility="Visible" TextWrapping="Wrap"
local:ScrollToEndBehavior.AutoScrollToEnd="True"/>
<Button Grid.Row="1" Content="{Binding ButtonContent}"
Command="{Binding RecordingCommand}"/>
</Grid>
</Window>
here
- Binding
- ScrollToEndBehavior.AutoScrollToEnd
- Storyboard
- Converter
I will briefly explain these four points.
・Binding
"Binding ***" binds to the data (property) of ***. By doing this, when you change the value of an element on the ViewModel side, it will be automatically reflected in the UI.[1]
・ScrollToEndBehavior.AutoScrollToEnd
When creating an app that complies with MVVM, it does not use code-behind (writing code in MainView.xaml.cs), so it is difficult to write processes that are executed when the state of the View changes. As an alternative, there is the Behavior class.[2]In the app we are creating this time, the recognition results will be added to the end. Therefore, in order to always display the latest recognition results, it is necessary to scroll to the end every time a recognition result is displayed. Therefore, we will use the Behavior class to scroll to the end every time a recognition result is displayed. The implementation of the Behavior class will be described later.
・Storyboard
Storyboard allows you to create animations in WPF. Here, we are creating a blinking animation to prevent the text box from being operated while recording and to let you know that recording is in progress. The content of the animation is determined in the area enclosed by the Storyboard tag. In this case, we are changing the Opacity property to make it blink. The Style tag written below the Storyboard tag sets the trigger that will fire the created animation. In this case, we are making it animate when the element using this Style is visible on the screen.This articleprovides a detailed explanation of Storyboard. Please refer to it if you want to know more about Storyboard.
Converter
A converter literally means something that converts.This articleAs described in the article, the IsChecked property (value is Boolean), which indicates whether a checkbox is checked, can be used with a Converter to represent its state as a string. In this app, we are trying to set the Visibility property to "Visibility" during recording, and "Hidden" or "Collapse" at other times. The value of the Visibility property is not Boolean, but an enumeration type, so it is not possible to control whether or not it is displayed on the screen with a Boolean type. Therefore, we are using a Converter so that when the value is true, it becomes "Visibility" and when it is false, it becomes "Collapse."
*The BooleanToVisibilityConverter class is implemented as standard. Therefore, by using this, you can implement the processing you want without having to create a new Converter class.
About ViewModel
The ViewModel class is responsible for connecting data-bound data and Model with the View. To data-bind to View elements, "INotifyPropertyChanged" and "ICommand" are used to notify the View that a value has changed. However, using these as is can be quite tedious. Therefore, to make it easier to use, install "Prism.Wpf" from NuGet. Prism is an MVVM framework. Using "Prism" makes it possible to write data-binding-related processes using "INotifyPropertyChanged" and "ICommand" in a short and simple manner.
The entire code for "MainViewModel.cs" is as follows:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.ComponentModel;
using com.amivoice.wrp;
using RecApp.Models;
using Prism.Commands;
using Prism.Mvvm;
namespace RecApp.ViewModels
{
// 録音状態を管理するクラス(ボタン文言変更用)
public class RecordingStateNotifyEventArgs
{
public RecordingState state { get; set; }
}
// テキストボックスに文字を表示させるためのクラス
public class ResultNotifyEventArgs
{
public string result { get; set; }
}
/// <summary>
/// MainView ウィンドウに対するデータコンテキストを表します。
/// </summary>
internal class MainViewModel : BindableBase
{
/// <summary>
/// アプリを切った際のイベント WebSocketの接続を切る
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
internal void Closing(object sender, CancelEventArgs e)
{
// WebSocketを接続したままアプリを落とした場合に
wrp?.disconnect();
}
/// <summary>
/// テキストボックスに書かれている内容
/// </summary>
private string _text;
public string RecognitionResultText
{
get { return this._text; }
set
{
SetProperty(ref _text, value);
}
}
/// <summary>
/// 録音ボタンに表示する文言
/// </summary>
private string _buttonContent = "録音を開始する";
public string ButtonContent
{
get { return this._buttonContent; }
set
{
SetProperty(ref _buttonContent, value);
}
}
/// <summary>
/// 録音中に表示するアニメーションを制御する
/// </summary>
private bool __isBlinkVisibility = false;
public bool IsBlinkVisibility
{
get { return this.__isBlinkVisibility; }
set
{
SetProperty(ref __isBlinkVisibility, value);
}
}
/// <summary>
/// 録音コマンドを取得します。
/// </summary>
public DelegateCommand RecordingCommand { get; }
// WebSocketに関する変数
Wrp wrp;
private Audio m_audio = null;
public MainViewModel()
{
// クリックイベントを設定
RecordingCommand = new DelegateCommand(ButtonClick );
// WebSocket 音声認識サーバイベントリスナの作成
WrpSimple listener = new WrpSimple();
// WebSocket 音声認識サーバの初期化
wrp = Wrp.construct();
wrp.setListener(listener);
// 接続するサーバー名
wrp.setServerURL("wss://acp-api.amivoice.com/v1/");
// 音声フォーマット
wrp.setCodec("LSB16K");
// 使用する辞書
wrp.setGrammarFileNames("-a-general");
// AppKey
wrp.setAuthorization("Your AppKey");
// 認識結果をテキストボックスへ書き込むためのイベント
listener.ResultNotifyHandler += SetTextNotifyHandler;
// NAudioを扱うクラスインスタンス化
m_audio = new Audio(wrp);
// 録音状態に変化があった際にボタンの文言を変更するためのイベント
// 通知を受ける関数の登録
m_audio.RecordingStateChanged += RecordingStateChanged;
// WebSocketの接続状態をテキストボックスへ書き込むためのイベント
m_audio.ResultNotifyHandler += SetTextNotifyHandler;
}
private void RecordingStateChanged(object sender, RecordingStateNotifyEventArgs e)
{
switch (e.state)
{
case RecordingState.Recording:
{
ButtonContent = "録音を停止する";
IsBlinkVisibility = true;
break;
}
case RecordingState.Stop:
case RecordingState.Error:
{
ButtonContent = "録音を開始する";
IsBlinkVisibility = false;
break;
}
}
}
private void SetTextNotifyHandler(object sender, ResultNotifyEventArgs args)
{
RecognitionResultText += args.result + "\r\n";
}
/// <summary>
/// ボタンをクリックした際の動作
/// </summary>
private void ButtonClick()
{
if (m_audio.m_recordingState != RecordingState.Stop)
{
// 停止する
m_audio.RecordingStop();
}
else
{
RecognitionResultText = "";
// 録音する
m_audio.RecordingStart();
}
}
}
}
Follow the code from top to bottom.
To display the data generated by the Model in the View, the data must be passed to the ViewModel. Create an event variable for the event that passes the data received from the Model to the View.
// 録音状態を管理するクラス(ボタン文言変更用)
public class RecordingStateNotifyEventArgs
{
public RecordingState state { get; set; }
}
// テキストボックスに文字を表示させるためのクラス
public class ResultNotifyEventArgs
{
public string result { get; set; }
}
This time, we will only be exchanging recording status and strings, so we will create a class that only has the recording status and strings.
/// <summary>
/// MainView ウィンドウに対するデータコンテキストを表します。
/// </summary>
internal class MainViewModel : BindableBase
"BindableBase" is a helper class for implementing "INotifyPropertyChanged". By inheriting this, you can easily notify the View side of value changes.
/// <summary>
/// アプリを切った際のイベント WebSocketの接続を切る
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
internal void Closing(object sender, CancelEventArgs e)
{
// WebSocketを接続したままアプリを落とした場合に接続を切る
wrp?.disconnect();
}
This is the process to disconnect the connection if the app is shut down while still connected to WebSocket. The closing process is added to MainView with "w.Closing += vm.Closing;" in "App.xaml.cs".
*If you close the app while recording, the WebSocket connection will remain active.
Next, regarding the data binding to the Button and TextBox and the trigger that triggers the animation,
/// <summary>
/// テキストボックスに書かれている内容
/// </summary>
private string _text;
public string RecognitionResultText
{
get { return this._text; }
set
{
SetProperty(ref _text, value);
}
}
/// <summary>
/// 録音ボタンに表示する文言
/// </summary>
private string _buttonContent = "録音を開始する";
public string ButtonContent
{
get { return this._buttonContent; }
set
{
SetProperty(ref _buttonContent, value);
}
}
/// <summary>
/// 録音中に表示するアニメーションを制御する
/// </summary>
private bool _isBlinkVisibility = false;
public bool IsBlinkVisibility
{
get { return this._isBlinkVisibility; }
set
{
SetProperty(ref _isBlinkVisibility, value);
}
}
It is written like this. When the value is changed by using SetProperty, a notification that the value has been changed is sent to the View. For more information on the processing of SetProparty,GithubThe code has been published, so please refer to it.
When the record button is pressed, the event to start and stop recording is passed to the View using DelegateCommand.
/// <summary>
/// 録音コマンドを取得します。
/// </summary>
public DelegateCommand RecordingCommand { get; }
// クリックイベントを設定
RecordingCommand = new DelegateCommand(ButtonClick );
As shown above, pass the processing you want to perform when the button is pressed to DelegateCommand as an argument.
Regarding WebSockets
// WebSocketに関する変数
Wrp wrp;
// WebSocket 音声認識サーバイベントリスナの作成
WrpSimple listener = new WrpSimple();
// WebSocket 音声認識サーバの初期化
wrp = Wrp.construct();
wrp.setListener(listener);
// 接続するサーバ名
wrp.setServerURL("wss://acp-api.amivoice.com/v1/");
// 音声フォーマット
wrp.setCodec("LSB16K");
// 使用する辞書
wrp.setGrammarFileNames("-a-general");
// AppKey
wrp.setAuthorization("Your AppKey");
// 認識結果をテキストボックスへ書き込むためのイベント
listener.ResultNotifyHandler += SetTextNotifyHandler;
It will look like this. This time we will not go into details about the server name to connect to, which is described in another article mentioned above. Set the WrpSimple class created in step 5 as the listener for the Wrp class. This will enable the created speech recognition result event to be fired. Also, the WrpSimple class must be modified to pass data to the ViewModel. For this reason, a new event is defined to pass data to the ViewModel. The contents of this definition will be described later.
The Audio class
private Audio m_audio = null;
// NAudioを扱うクラスインスタンス化
m_audio = new Audio(wrp);
// 録音状態に変化があった際にボタンの文言を変更するためのイベント
// 通知を受ける関数の登録
m_audio.RecordingStateChanged += RecordingStateChanged;
// WebSocketの接続状態をテキストボックスへ書き込むためのイベント
m_audio.ResultNotifyHandler += SetTextNotifyHandler;
It will look like this. The Wrp class is required to send audio to the ACP server. Therefore, the Audio class also needs to be modified to connect with the ACP server and ViewModel. The changes will be described later. For now, it is sufficient to understand that "events are defined and processing is added to connect with the ACP server and ViewModel."
About the Behavior class
First, create a file to write the Behavior class. Create a new Behavior folder in the Views folder. Create a class file in it and name it "ScrollToEndBehavior.cs" (hereafter referred to as the ScrollToEndBehavior class). Write a process in the ScrollToEndBehavior class that will automatically move the scroll bar to the bottom when new recognition results are displayed.
This articleThe ScrollToEndBehavior class created with reference to the attached behavior is as follows.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows;
using System.Windows.Controls;
namespace RecApp.Views.Behavior
{
class ScrollToEndBehavior
{
/// <summary>
/// 複数行のテキストを扱う
/// テキスト追加時に最終行が表示されるようにする
/// </summary>
public static readonly DependencyProperty AutoScrollToEndProperty =
DependencyProperty.RegisterAttached(
"AutoScrollToEnd", // プロパティ名を指定
typeof(bool), // プロパティの型を指定
typeof(ScrollToEndBehavior), // プロパティを所有する型を指定
new FrameworkPropertyMetadata(false, IsTextChanged) // メタデータを指定
);
[AttachedPropertyBrowsableForType(typeof(TextBox))] // xaml側のプロパティで表示する型指定
// Get Setを記述する必要があるため記述
public static bool GetAutoScrollToEnd(DependencyObject obj)
{
return (bool)obj.GetValue(AutoScrollToEndProperty);
}
public static void SetAutoScrollToEnd(DependencyObject obj, bool value)
{
obj.SetValue(AutoScrollToEndProperty, value);
}
// プロパティ
private static void IsTextChanged(DependencyObject sender, DependencyPropertyChangedEventArgs e)
{
TextBox textBox = (TextBox)sender;
if (textBox == null) return;
// イベントを登録・削除
textBox.TextChanged -= OnTextChanged;
bool newValue = (bool)e.NewValue;
if (newValue == true)
{
textBox.TextChanged += OnTextChanged;
}
}
private static void OnTextChanged(object sender, TextChangedEventArgs e)
{
TextBox textBox = (TextBox)sender;
if (textBox == null) return;
if (string.IsNullOrEmpty(textBox.Text)) return;
if (textBox.IsKeyboardFocused == false) textBox.ScrollToEnd();
}
}
}
For attached behaviors seeThis articleSince it is explained in detail in, I will only explain the OnTextChanged event that I added.
OnTextChanged is set to scroll to the end if you change the text in the TextBox. However, it does not scroll if the TextBox has keyboard focus. This is because when editing the recognition results, scrolling to the end every time you change a single character makes editing difficult. For this reason, it is set to scroll only while recording.
There are two concepts of focus in WPF:
- Keyboard Focus
- Logical Focus
There are two types of focus. The difference between them is
- Keyboard focus: The element currently receiving keyboard input, of which there is only one across the entire desktop.
- Logical focus: There can be only one within any given focus range. There can be multiple.
Want to know more?Official DocPlease refer to.
Interaction between the recording class and WebSocket processing
To perform speech recognition, the recorded audio data must be passed to the ACP server. To do this, we will modify the Audio class. Since we will modify the Audio class again in conjunction with the UI in the next section, here we will explain the methods used to connect to the ACP server and perform speech recognition.
using com.amivoice.wrp;
// WebSocket関連のクラス
private Wrp m_wrp = null;
public Audio (Wrp wrp)
{
// サンプリング周波数 16000Hz 1ch 16bit PCMのフォーマット作成
m_recordinFormat = new WaveFormat (16000, 1);
m_wrp = wrp;
}
First, to use the Wrp class, add "com.amivoice.wrp" to the using at the top, and declare a variable to hold an object within the Audio class. Pass a value to the declared object when instantiating the Audio class.
// 音声認識サーバへの接続
m_wrp.connect()
// 音声認識サーバへの音声データの送信開始
m_wrp.feedDataResume()
// 音声認識サーバへの音声データの送信
m_wrp.feedData(ee.BUffer, 0 , ee.BytesRecorded)
// 音声認識サーバへの音声データの送信完了
m_wrp.feedDataPause()
// 音声認識サーバから切断
m_wrp.disconnect()
The procedure for sending audio data to the ACP server is as follows:
- Connect to the server
- Sends a message to start sending audio data
- Sends audio data to the server
- Send a transmission completion message
- Disconnect from the server
By following these steps, you can send voice data to the server and perform voice recognition. The corresponding Wrp method is
- connect
- feedDataResume
- feedData
- feedDataPause
- disconnect
By using these in order, you can perform speech recognition. Since the audio data is sent to the ACP server using the feedData method, this method must be executed within DataAvailable of the WaveIn class.
Interaction between recording class, recognition result processing class and UI
Here's what our final Audio class looks like:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using NAudio.Wave;
using com.amivoice.wrp;
using RecApp.ViewModels;
namespace RecApp.Models
{
/// <summary>
/// 録音状態を表す列挙型
/// </summary>
public enum RecordingState
{
Recording,
Stop,
Error
}
/// <summary>
/// NAudioを操作するクラス
/// </summary>
class Audio
{
// 録音を行うクラス
private WaveIn m_waveIn = null;
// 録音時のフォーマットを設定するクラス
private WaveFormat m_recordinFormat = null;
// WebSocket関連のクラス
private Wrp m_wrp = null;
// 接続情報をテキストボックスへ書きこむためのイベント
public event EventHandler<ResultNotifyEventArgs> ResultNotifyHandler;
// 録音状態を管理する変数
private RecordingState _recordingState = RecordingState.Stop;
// ここの値を変更する際にイベントを発火
public RecordingState m_recordingState
{
get
{
return _recordingState;
}
private set
{
// 状態が変化しているか
if (this._recordingState == value) return;
this._recordingState = value;
// データが変更されたときに通知することをここで集中管理
OnRecordingStateChanged(value);
}
}
// 録音状態状態の変更を通知するevent
public event EventHandler<RecordingStateNotifyEventArgs> RecordingStateChanged;
public Audio(Wrp wrp)
{
// サンプリング周波数 16000Hz 1ch 16bit PCMのフォーマット作成
m_recordinFormat = new WaveFormat(16000, 1);
m_wrp = wrp;
}
/// <summary>
/// 録音開始
/// </summary>
public void RecordingStart()
{
// 音声認識サーバへの接続
if (m_wrp.connect() == false)
{
SetText(m_wrp.getLastMessage());
SetText("WebSocket 音声認識サーバへの接続に失敗しました。");
return;
}
SetText("WebSocket 音声認識サーバへの接続に成功しました。");
// 初期化&録音時のフォーマットを設定
m_waveIn = new WaveIn();
m_waveIn.WaveFormat = m_recordinFormat;
// 録音デバイス設定
// デフォルトデバイスで録音
m_waveIn.DeviceNumber = 0;
// 録音中に発生するイベント
m_waveIn.DataAvailable += (_, ee) =>
{
try
{
// エラーが起きていないか
if (m_recordingState == RecordingState.Recording)
{
// 音声認識サーバへの音声データの送信
if (m_wrp.feedData(ee.Buffer, 0, ee.BytesRecorded) == false)
{
m_recordingState = RecordingState.Error;
SetText(m_wrp.getLastMessage());
SetText("WebSocket 音声認識サーバへの音声データの送信に失敗しました。");
// 録音停止
RecordingStop();
}
}
}
catch (Exception e)
{
SetText("エラー発生");
SetText(e.Message);
// エラー確認
if (m_recordingState == RecordingState.Recording)
{
// 二重に停止処理が起きないようにする
m_recordingState = RecordingState.Error;
// 録音停止
RecordingStop();
}
}
};
// 録音終了時のイベント
m_waveIn.RecordingStopped += (_,__) =>
{
SetText("録音終了");
// 録音状態が録音中・エラーの場合はインスタンスを解放する
if (m_recordingState != RecordingState.Stop)
{
// 録音状態変更
m_recordingState = RecordingState.Stop;
// WaveInインスタンス解放
m_waveIn.Dispose();
m_waveIn = null;
// 音声認識サーバへの音声データの送信完了
if (m_wrp.feedDataPause() == false)
{
SetText(m_wrp.getLastMessage());
SetText("WebSocket 音声認識サーバへの音声データの送信完了に失敗しました。");
}
// 音声認識サーバから切断
m_wrp.disconnect();
SetText("WebSocket 音声認識サーバへの接続を切断しました。");
}
};
// 音声認識サーバへの音声データの送信開始
if (m_wrp.feedDataResume() == false)
{
m_wrp.disconnect();
SetText(m_wrp.getLastMessage());
SetText("WebSocket 音声認識サーバへの音声データの送信開始に失敗しました。");
return;
}
// 録音開始
m_waveIn.StartRecording();
m_recordingState = RecordingState.Recording;
}
/// <summary>
/// 録音停止
/// </summary>
public void RecordingStop()
{
// 録音停止
m_waveIn?.StopRecording();
}
/// <summary>
/// 接続状態をテキストボックスへ書き込む
/// </summary>
/// <param name="result">書き込む内容</param
private void SetText(string result)
{
if (ResultNotifyHandler != null)
{
var args = new ResultNotifyEventArgs() { result = result };
ResultNotifyHandler(this, args);
}
}
/// <summary>
/// 録音状態変更イベントを発火
/// </summary>
/// <param name="state"></param>
private void OnRecordingStateChanged(RecordingState state)
{
if (RecordingStateChanged != null)
{
var args = new RecordingStateNotifyEventArgs() { state = state };
RecordingStateChanged(this, args);
}
}
}
}
Define an event to check if the WebSocket connection has been established and display the information from the Model in the View.
// 接続情報をテキストボックスへ書きこむためのイベント
public event EventHandler<ResultNotifyEventArgs> ResultNotifyHandler;
/// <summary>
/// 接続状態をテキストボックスへ書き込む
/// </summary>
/// <param name="result">書き込む内容</param
private void SetText(string result)
{
if (ResultNotifyHandler != null)
{
var args = new ResultNotifyEventArgs() { result = result };
ResultNotifyHandler(this, args);
}
}
Create an event to display button text or Blink depending on the recording status.
// 録音状態を管理する変数
private RecordingState _recordingState = RecordingState.Stop;
// ここの値を変更する際にイベントを発火
public RecordingState m_recordingState
{
get
{
return _recordingState;
}
private set
{
// 状態が変化しているか
if (this._recordingState == value) return;
this._recordingState = value;
// データが変更されたときに通知することをここで集中管理
OnRecordingStateChanged(value);
}
}
// 録音状態状態の変更を通知するevent
public event ChangedEventHandler RecordingStateChanged;
/// <summary>
/// 録音状態変更イベントを発火
/// </summary>
/// <param name="sender"></param>
private void OnRecordingStateChanged(object sender)
{
if (RecordingStateChanged != null)
{
var args = new RecordingStateNotifyEventArgs() { state = state };
RecordingStateChanged(this, args);
}
}
Then your final WrpSimple.cs should look like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using RecApp.ViewModels;
namespace RecApp.Models
{
/// <summary>
/// 認識結果を扱うクラス
/// </summary>
class WrpSimple : com.amivoice.wrp.WrpListener
{
// 接続情報をテキストボックスへ書きこむためのイベント
public event EventHandler<ResultNotifyEventArgs> ResultNotifyHandler;
public void utteranceStarted(int startTime) { }
public void utteranceEnded(int endTime) { }
public void resultCreated() { }
public void resultUpdated(string result) { }
public void resultFinalized(string result)
{
string text = TextParse(result);
SetText(text);
}
public void eventNotified(int eventId, string eventMessage) { }
public void TRACE(string message) { }
/// <summary>
/// 認識結果を変換
/// </summary>
/// <param name="result">JSON形式の認識結果文字列</param>
/// <returns>認識結果
private string TextParse(string result)
{
int index = result.LastIndexOf(",\"text\":\"");
if (index == -1)
{
return null;
}
index += 9;
int resultLength = result.Length;
StringBuilder buffer = new StringBuilder();
int c = (index >= resultLength) ? 0 : result[index++];
while (c != 0)
{
if (c == '"')
{
break;
}
if (c == '\\ ')
{
c = (index >= resultLength) ? 0 : result[index++];
if (c == 0)
{
return null;
}
if (c == '"' || c == '\\' || c == '/')
{
buffer.Append((char)c);
}
else
if (c == 'b' || c == 'f' || c == 'n' || c == 'r' || c == 't')
{
}
else
if (c == 'u')
{
int c0 = (index >= resultLength) ? 0 : result[index++];
int c1 = (index >= resultLength) ? 0 : result[index++];
int c2 = (index >= resultLength) ? 0 : result[index++];
int c3 = (index >= resultLength) ? 0 : result[index++];
if (c0 >= '0' && c0 <= '9') { c0 -= '0'; } else if (c0 >= 'A' && c0 <= 'F') { c0 -= 'A' - 10; } else if (c0 >= 'a' && c0 <= 'f') { c0 -= 'a' - 10; } else { c0 = -1; }
if (c1 >= '0' && c1 <= '9') { c1 -= '0'; } else if (c1 >= 'A' && c1 <= 'F') { c1 -= 'A' - 10; } else if (c1 >= 'a' && c1 <= 'f') { c1 -= 'a' - 10; } else { c1 = -1; }
if (c2 >= '0' && c2 <= '9') { c2 -= '0'; } else if (c2 >= 'A' && c2 <= 'F') { c2 -= 'A' - 10; } else if (c2 >= 'a' && c2 <= 'f') { c2 -= 'a' - 10; } else { c2 = -1; }
if (c3 >= '0' && c3 <= '9') { c3 -= '0'; } else if (c3 >= 'A' && c3 <= 'F') { c3 -= 'A' - 10; } else if (c3 >= 'a' && c3 <= 'f') { c3 -= 'a' - 10; } else { c3 = -1; }
if (c0 == -1 || c1 == -1 || c2 == -1 || c3 == -1)
{
return null;
}
buffer.Append((char)((c0 << 12) | (c1 << 8) | (c2 << 4) | c3));
}
else
{
return null;
}
}
else
{
buffer.Append((char)c);
}
c = (index >= resultLength) ? 0 : result[index++];
}
return buffer.ToString();
}
/// <summary>
/// 接続状態をテキストボックスへ書き込む
/// </summary>
/// <param name="result">書き込む内容</param
private void SetText(string result)
{
if (ResultNotifyHandler != null)
{
var args = new ResultNotifyEventArgs() { result = result };
ResultNotifyHandler(this, args);
}
}
}
}
WrpSimple.cs also defines an event to display in the View created in Audio.cs. Here, an event is created to display the recognition results.
Summary
This time, I tried to implement microphone recording in a Windows application. I hope you will also try developing a voice recognition application using ACP.
Reference
Person who wrote this article
-
Koseki Yuuta
I am developing server side using Python.
Most viewed articles
- A quick explanation of how speech recognition works!
- Comparing the speech recognition rates of OpenAI's Whisper and AmiVoice for "conference" audio
- How to use the AmiVoice API free coupon
New articles
- How to use coupons for Zenn Spring 2026
- "Speech segment ratio" as seen in operational data
- AmiVoice API Update Explanation: New Parameters for Voicebots Reduce Response Wait Times
Category list
- Introduction to Speech Recognition (15)
- How to improve voice recognition accuracy (12)
- I tried developing it (27)
- How to use AmiVoiceAPI(27)
- Comparison and Verification (6)
- Others(10)
