The following schema describes how to use the mixer
module to combine word output from multiple modules into a single timeline.
[StereoSpeechToText2] 0 = l,r,ts ← audio(STEREO, Input) 1 = f1 ← frontend1(_, a:l) 2 = nf1 ← normalizer1(_, f1) 3 = w1 ← stt1(_, nf1) 4 = f2 ← frontend2(_, a:r) 5 = nf2 ← normalizer2(_, f2) 6 = w2 ← stt2(_, nf2) 7 = w3 ← mixer(_, wa:w1, wb:w2) 8 = output ← wout(_, w3, ts)
0
|
The audio module processes the input stereo audio file as left and right audio data. |
1
|
The frontend1 module converts the left audio channel (l ) into speech front-end frame data. In this step, the variable form a:l represents the change of name for the left channel audio data (type l ) to audio data (type a ). |
2
|
The normalizer1 module normalizes the frame data from 1 (f1 ). |
3
|
The stt1 module converts the normalized frame data from 2 (nf1 ) into text. |
4
|
The frontend2 module converts the right audio channel (r ) into speech front-end frame data. In this step, the variable form a:r represents the change of name for the right channel audio data (type r ) to audio data (type a ). |
5
|
The normalizer2 module normalizes the frame data from 5 (f2 ). |
6
|
The stt2 module converts the normalized frame data from 6 (nf2 ) into text. |
7
|
The mixer module combines the recognized words resulting from 3 (w1 ) and from 6 (w2 ) into a single word output timeline (w3 ). |
8
|
The wout module writes the recognized words resulting from 7 (w3 ) to the output file. |
|