IDOL Speech Server can perform real-time speech-to-text conversion on a file or audio stream. The following sample configuration is for such a task.
[SpeechToText] 0 = a,ts <- audio(MONO, input) 1 = f <- frontend(_, a) 2 = nf <- normalizer(_, f) 3 = w1 <- stt(_, nf) 4 = w2 <- postproc(_, w1) 5 = output <- wout(_, w2, ts) DefaultResults=Out
0
|
The audio module processes stream audio data. |
1
|
The frontend module converts audio data into speech front-end frame data. |
2
|
The normalizer module normalizes frame data from 1 (f ). |
3
|
The stt module converts normalized frame data from 2 (nf ) into text. |
4
|
The postproc module runs any post processing tasks on the results from 3 (w1 ). |
5
|
The wout module writes the recognized words resulting from 4 (w2 ) to the output file. |
You can also use the [SpeechToTextFilter]
schema to combine audio stream-to-text conversion with speech classification in a single step so that you can then remove sections classified as music or noise from the resulting output file.
|