To perform speech-to-text conversion on stereo audio input data, each channel can be processed separately. For example:
[SpeechToText] 0 = l,r ← audio(STEREO, input) 1 = f1 ← frontend1(_, a:l) 2 = nf1 ← normalizer1(_, f1) 3 = w1 ← stt1(_, nf1) 4 = output ← wout1(_, w1) 5 = f2 ← frontend2(_, a:r) 6 = nf2 ← normalizer2(_, f2) 7 = w2 ← stt2(_, nf2) 8 = output ← wout2(_, w2)
0
|
The audio module processes the input stereo audio file as left and right audio data. |
1
|
The frontend1 module converts left audio channel (l ) into speech front-end frame data. In this step, the variable form a:l represents the change of name for the left channel audio data (type l ) to audio data (type a ). |
2
|
The normalizer1 module normalizes the frame data from 1 (f1 ). |
3
|
The stt1 module converts the normalized frame data from 2 (nf1 ) into text. |
4
|
The wout1 module writes the recognized words resulting from 3 (w1 ) to the output file. |
5
|
The frontend2 module converts right audio channel (r ) into speech front-end frame data. In this step, the variable form a:r represents the change of name for the right channel audio data (type r ) to audio data (type a ). |
6
|
The normalizer2 module normalizes frame data from 5 (f2 ). |
7
|
The stt2 module converts the normalized frame data from 6 (nf2 ) into text. |
8
|
The wout2 module writes the recognized words resulting from 7 (w2 ) to the output file. |
|