IDOL Speech Server provides four preconfigured speech-to-text tasks:
SpeechToText
, which performs speech-to-text on an audio file or streamSpeechToTextFilter
, which performs speech-to-text on an audio stream and categorizes the audio so that you can remove any sections categorized as music or noise from the resulting .CTM
file.SpeechToTextTelephony
, which performs speech-to-text on audio files of telephone conversations. The task also detects and reports dial tones and DTMF dial tones (see DTMF Identification).ClusterSpeechToTextTel
, which clusters two speakers in a phone call, and uses the resulting speaker clusters to improve speech-to-text performance slightly by using speaker-sided acoustic normalization. Any telephony artifacts such as dial tones or DTMF tones are included, interspersed with the recognized words.
You can set Punctuation
to True
in any of these tasks to perform speech-to-text that includes simple sentence-forming punctuation (for example, full stops and initial capital letters) in the .CTM
file. The speech-to-text task estimates the start and end of the sentence, although this is a best guess only and is not 100% accurate.
The Punctuation
parameter should be used only for languages that use the Latin alphabet.
You can use the SpeedBiasLevel
parameter in any speech-to-text task to quickly set the balance between speed and accuracy in the decoder. By default, SpeedBiasLevel
is set to 0
, which leaves the underlying parameter settings untouched (that is, quick configuration of relevant parameters is disabled). To enable the speed configuration, set SpeedBiasLevel
to a value between 1
(slowest) and 6
(fastest). The default speech-to-text parameters are equivalent to a speed bias of 2
.
You can use the SpeedBiasLevel
functionality only when the speech-to-text mode is fixed
(see Control Speech-to-Text Process Speed), and with a DNN-based language resource.
You can also use the PunctuateCtm
task to add punctuation to any .CTM
file. For more information, see the IDOL Speech Server Reference.
To run speech-to-text on an audio file
Send an AddTask
action to IDOL Speech Server, and set the following parameters:
Type
|
The task name. Specify SpeechToText . |
File
|
The audio file to process. To restrict processing to a section of the audio file, set the StartTime and EndTime parameters. For more information, see the IDOL Speech Server Reference). |
Out
|
The file to write the transcription to. |
Lang
|
The language pack to use. |
For example:
http://localhost:15000/action=AddTask&Type=SpeechToText&File=C:/myData/Speech.wav&Out=SpeechTranscript.ctm&Lang=ENUS
This action performs the SpeechToText
task on the Speech.wav
file and writes the results to the SpeechTranscript.ctm
file. The Speech.wav
file contains U.S. English dialect speech.
If you are using a lattice file and want to reduce the lattice output size by including only one sample of each word in a specific window size, you can also set the LatWinSize
parameter. See Use a Lattice File and the IDOL Speech Server Reference for more information.
This action returns a token. You can use the token to:
When you use IDOL Speech Server to process multiple data streams or files at the same time, the server might not have enough CPU or memory to process all of them at once. Speech-to-text operation is very CPU-intensive. To check whether a server has sufficient resources to run a SpeechToText
task, send a CheckResources
action. See Check Available Resources.
|