The ClusterSpeechToTextTel
task performs clustering of two speakers in a phone call, and uses the resulting speaker clusters to improve speech-to-text performance slightly by using speaker-sided acoustic normalization. As before, any telephony artifacts such as dial tones or DTMF tones are included, interspersed with the recognized words.
Parameter | Description | Required |
---|---|---|
Type | The task name. Set to ClusterSpeechToTextTel . |
Yes |
AppDnnBase | The location of the appResources directory, which contains the DNN and .ian files to use. |
|
AppFrameDupl | The balance between performance and speed for audio preprocessing DNN classification. | |
AudioUpsampling | Whether to allow audio upsampling if the input audio has a sample rate too low for the task. | |
ClassWordFile | The path to a list of new words and weightings to add to the language model at load time. | |
Conf | Whether to generate word confidence scores. | |
CustomLm | The custom language model to use. | |
Diag | Whether to generate diagnostic information. | |
DiagFile | The file to write the diagnostic information to. | |
DnnFile | The DNN file to use. | |
DnnScale | The DNN output acoustic score scaling factor. | |
EndTime | The end of an audio section to process. | |
File | The input audio file. | Yes, if InputType is File . |
FixTime | A fixed size for speaker clusters. | |
ForceRecompoundOff | Whether to prevent recompounding. | |
ForceRecompoundOn | Whether to force recompounding. | |
InputType | The type of audio to process (file, binary data, or stream). | |
Lang | The name of a language pack. | Yes |
LatFile | The name of the lattice file that contains word hypotheses. | |
LatScale | The depth of the lattice. | |
LatWinSize | The size (in seconds) of the lattice output window. | |
LatWordFile | A list of words to find. | |
MaxNumSpeakers | The final maximum number of speakers to produce. | |
MergeThresh | The threshold below which to merge clusters. | |
MinNumSpeakers | The final minimum number of speakers to produce. | |
Mode | The algorithm mode for the speech-to-text process. | |
ModeValue | Sets the value of the parameter associated with the speech-to-text algorithm mode. | |
Out | The file to write the results to. | |
PronFile | A file to use to either replace or add alternative pronunciations of words at language load time. | |
Punctuation | Whether to add punctuation to the word data. | |
SilThresh | The threshold between what the task identifies as silence and non-silence. | |
SpeedBiasLevel | The balance between speed and accuracy in the decoder. | |
SpeechThresh | The threshold between speech and non-speech (music or noise). | |
StartTime | The beginning of an audio section to process. | |
SugdInputChannels | The channel layout of the input media file. This parameter does not apply when InputType is Stream . |
|
SugdInputFrequency | The sampling rate of the input media file. This parameter does not apply when InputType is Stream . |
|
WordBar | Switches on word barring. | |
WordBarList | The location of a list of words to be barred. |
http://localhost:15000/action=AddTask&Type=ClusterSpeechToTextTel&File=C:/myData/Speech.wav&Out=SpeechTranscript.ctm&Lang=ENUS
This action performs the ClusterSpeechToTextTel
task on the Speech.wav
file and writes the results to the SpeechTranscript.ctm
file. The Speech.wav
file contains U.S. English dialect speech.
|