The process flow for speech-to-text from the audio data point of view is:
IDOL Speech Server contains a separate module for each processing step. The following diagram shows how you can implement speech-to-text from the processing modules available in IDOL Speech Server.
|
The audio module reads the audio file and prepares windowed data. |
a is the audio window series. |
|
The |
|
f is the feature vector series. |
|
The |
|
nf is the normalized feature vector series. |
|
The |
|
w is the output time-marked word series. |
|
The |
You must create this processing sequence in the IDOL Speech Server tasks configuration file to represent a single action. In this case, the sequence results in the following configuration section.
[MySpeechToText] 0 = a, ts ← audio (MONO, input) 1 = f ← frontend (_, a) 2 = nf ← normalizer (_, f) 3 = w ← stt(_, nf) 4 = output ← wout(_, w, ts)
The notation comes from the functional programming style.
0 =
) are redundant and indicate line numbering only.audio
, frontend
, normalizer
.(MODE, INPUT(s))
. So (_, w)
specifies the default mode,“_
”, and the input word stream, w
.a
, f
, nf
, and w
all refer to instances of different types of data series. So line 0
specifies that the audio
module operates in MONO
mode, receiving an input file and producing a
as the output data stream (a
represents mono audio data–see Name a Data Stream Instance for data stream types). Line 1
specifies that the frontend
module operates in default mode, receiving a
as the input data stream and producing f
as the output data stream.
You must also configure each of the processing modules in the sequence. The following example configures the audio
module.
[audio] SampleFrequency = 16000 File = C:\Audio\Speech.wav
This example configures the audio
module to:
Speech.wav
.The following settings configure the frontend
module to operate at 16 kHz.
[frontend] SampleFrequency = 16000
The following settings configure the normalizer
module to use a parameter file that is usually available with the language pack. For more information about the IanFile
parameter, see the IDOL Speech Server Reference.
[normalizer] IanFile=$Stt.Lang.NormFile
The following settings configure the stt
module to use the UK English language pack, turn on run-time diagnostics, and set the running mode to fixed
, with a mode value of 4
. You must also ensure the language pack section for ENUK
is configured (see Language Configuration).
[stt] Lang = ENUK Diag = True DiagFile = diag.log Mode = fixed ModeValue = 4
Finally, the wout
module is configured to write the results in CTM format to the output.ctm
file.
[wout] Format = ctm Output = output.ctm
The language pack section needs to be set up once, or is set up by the installer. The entries in this section change only when a new language pack is installed.
The default configuration is:
[ENUK] PackDir = U:\\lang\\ENUK Pack = ENUK-5.0 SampleFrequency = 16000 DNNFile = $params.DNNFile
By default, IDOL Speech Server picks up the value of the DNNFile
parameter from Pack
and PackDir
like other parameters. Alternatively, you can specify another DNNFile
to use at the command line or in the task configuration file. For example, in fixed mode, you might want to use the *-fast
. DNN file included in each language pack. This faster version is generally necessary for live or relative mode where processing speed is critical. In this case, it is used automatically and does not need to be explicitly selected.
For information on how to configure the language pack section, see Configure Language Packs.
HPE recommends (and for 7.0+ versions of language packs, it is compulsory) that you include the following lines in the configuration file for the [frontend]
and [normalizer]
modules, so that IDOL Speech Server can access the header to determine the quantity and nature of the extracted acoustic feature vectors:
DNNFile = $stt.lang.DNNFile DNNFileStd = $stt.lang.DNNFileStd
For more information, see the IDOL Speech Server Reference.
The complete configuration file section for the speech-to-text function is shown below. You must declare all schemas and language packs above this section in the tasks configuration file.
[TaskTypes] 0 = MySpeechToText
[Resources] 0 = ENUK
[MySpeechToText] 0 = a, ts ← audio (MONO, input) 1 = f ← frontend (_, a) 2 = nf ← normalizer (_, f) 3 = w ← stt(_, nf) 4 = output ← wout(_, w, ts)
[audio] SampleFrequency = 16000 File = Speech.wav
[frontend] SampleFrequency = 16000 DNNFile = $stt.lang.DNNFile DNNFileStd = $stt.lang.DNNFileStd
[normalizer] IanFile = $stt.Lang.NormFile DNNFile = $stt.lang.DNNFile DNNFileStd = $stt.lang.DNNFileStd
[stt] Lang = ENUK Diag = True DiagFile = diag.log Mode = fixed ModeValue = 4
[wout] Format = ctm Output = output.ctm
[ENUK] PackDir = U:\\lang\\ENUK Pack = ENUK-5.0 SampleFrequency = 16000
The action command that runs this speech-to-text task is
http://SpeechServerhost:ACIport
/action=AddTask&Type=MySpeechToText
|