Use Your Content > Improve > Audio Analysis > Control Speech-to-Text Speed

Control Speech-to-Text Speed

Speech-to-text processing using the IDOL Speech Server stt module can be a very resource-intensive process. In some cases, this is not an issue (for example, if you are batch-processing a large library of audio files, and optimizing recognition accuracy is your only goal). However, in other situations (for example, the processing of speech spoken directly into a microphone), you need a greater degree of control to get the best results.

For this reason, IDOL Speech Server provides three modes that you can choose to ensure the best possible performance in an acceptable time frame:

You can choose the mode that you want to use by setting the Mode parameter of the stt module to fixed, relative, or live.

You can also set the ModeValue parameter to control the speed of the speech-to-text process in fixed and relative mode. The default value and the range of accepted values depend on the mode that you select.

NOTE:

Neither fixed mode nor relative mode should be deemed intrinsically 'faster' or 'slower' than the other -- it is perfectly possible for processing to be faster in fixed mode and slower in relative mode.

The following sections describe these processing modes in detail, and explain how you can perform speech recognition quickly on a large audio file by segmenting it and performing speech-to-text in parallel using multiple task managers.

DNN acoustic modeling was added to IDOL Speech Server in the 10.8 release. Whilst in general DNN acoustic modeling is faster than GMM acoustic modeling, the fastest possible performance is slower. The final section describes the implications of this for relative mode.

Fixed Mode

In fixed mode, the speech-to-text engine attempts to produce a transcription of uniform quality across all your data, regardless of how long it takes.

NOTE:

Noisy or poorly discernible data requires more CPU usage and therefore takes longer to process compared with 'clean' data.

In general, HPE recommends that you use fixed mode unless there is a specific reason not to.

The default value for the ModeValue parameter in fixed mode is 4. The minimum (fastest) value is 1, whereas the maximum (slowest) value is 8.

NOTE:

If you reduce the value of the Mode parameter to below the default value of 4, processing speed increases, but there is a small loss of accuracy. Similarly, if you increase the beamwidth beyond 4, recognition is significantly slower, and there might or might not be a discernible improvement in recognition accuracy.

Example configuration and action

[stt]
Mode=$params.Mode
ModeValue=$params.ModeValue

http://localhost:13000/action=AddTask&Type=WavToText&Lang=ENUK&Out=Transcript1.ctm&Mode=Fixed&ModeValue=4.0

Relative Mode

In relative mode, IDOL Speech Server aims to crunch through the audio data at a uniform pace dictated by a target rate that you specify.

The mode value represents the data processing rate compared to real time. If you set ModeValue to 1.0, Speech Server aims to process your speech data in real time; if you set ModeValue to 0.5, Speech Server processes the data twice as fast as real time

You can specify a mode value between 0.2 and 2.0 times real-time respectively. When you use DNN acoustic modeling, the speed of DNN propagation is too slow to maintain a maximum real-time speed of 0.2, so a higher minimum speed is necessary. You can specify the mode value when you run the action. For example:

http://localhost:13000/action=AddTask&Type=wavToText&Lang=ENUK&Out=Transcript1.ctm&Mode=relative&ModeValue=1.0

NOTE:

Relative mode was designed to be used with relatively simple configuration schemas such as WavToText or StreamToText (for more information, see the IDOL Speech Server Administration Guide and the IDOL Speech Server Reference). In particular, it works best if a single STT module takes nearly 100% of the processing time for the schema; modules such as wav, frontend, and normalizer are largely negligible in this respect.

You can still use relative mode with more complex schemas that include operations such as speaker identification or a second speech-to-text task. However, in these cases there is no longer a direct association between the mode value and the target speed, so you might have to reduce the mode value to acheive your desired speed. For example, if you are performing speaker identification and you therefore expect 90% of the processing time to be used by speech-to-text, set ModeValue to 0.9 to achieve real-time processing.

For more complex schemas, HPE recommends a process of trial and error.

Relative Mode Capping

Although the target rate in relative mode is generally achieved in almost all circumstances, no time guarantees can be offered. particularly in the case of network or hardware issues.

To maintain quality, there is a minimum amount of recognition effort that IDOL Speech Server always undertakes. This can mean that processing takes longer than your target time. There is also a cap to the maximum effort spent on recognition, which can lead to the recognizer finishing earlier than your target time, or using the saved time for other tasks later on.

Use Live Mode for Streaming

Live mode is designed not for file-based speech-to-text, but for data produced and streamed in real time (such as from a microphone).

Live mode is similar to relative mode in that recognition speed (and therefore quality) can vary based on certain criteria. However, whereas in relative mode you specify a target rate, live mode controls the recognition speed based on the speed at which samples are submitted to the speech-to-text process.

If you send data to the server in real time, recognition is performed in real time. If you send data twice as fast as real time, recognition is performed twice as fast as real time, and so on. If the rate changes during processing, recognition continues at the same speed.

NOTE:

If the data streams at a rate that is too fast for the computational resources of your system, recognition accuracy might be impaired. Speech Server tries to perform recognition at a speed that is fast enough to ensure that there is always some audio data waiting to be processed (but not so much that the processing falls behind). It is therefore clear that the common error of attempting to use live mode on a file on disk is asking for recognition to be performed instantaneously, with subsequent very poor results.

If you want to use DNN recognition, the same hardware requirements apply for live mode as for relative mode apply. Live mode switches between small, medium, and large networks in the same way as relative mode, to maintain the best possible performance in the time available. For more information, see the section below on the implications of DNN acoustic modeling.

To use live mode in live streaming speech-to-text tasks, you must add the Mode configuration parameter to the configuration sections for the stt and stream modules, if it is not already present.

Example configuration

[stream]
Mode=$params.Mode

[stt]
Mode=$params.Mode

This configuration creates a Mode action parameter. To use live mode, set the Mode action parameter to live in a task action that uses the stt and stream modules, such as StreamToText.

Example action

http://localhost:13000/action=AddTask&Type=StreamToText&Lang=ENUK&Out=Transcript1.ctm&Mode=Live

Run a Single Speech-to-Text Task Across Multiple Cores

To improve the time it takes to process a single long audio file, you can share the processing across multiple cores in parallel, rather than sequentially processing on a single core. For example, in the case of a one-hour audio file that would typically take one hour to transcribe, you can split the file into four tasks and process the file in approximately fifteen minutes.

The audio file that you want to process is split into (slightly overlapping) chunks, which are shared out between the IDOL Speech Server task managers (HPE recommends that you set one task manager for each core). Each running task manager counts as a single instance of IDOL Speech Server for licensing purposes. The number of task managers that you request must not exceed the total number of task managers available (as specified in the configuration file).

Each task manager processes its allocated chunks, and the results from all task managers are combined at the end.

NOTE:

Multiple core processing is supported only in fixed mode. Relative and live modes are not supported. Similarly, multiple core processing is not permitted when server queuing is enabled.

The following example action splits the Speech.wav file across three task managers, runs the WavToText task across all three, and merges the results at the end:

http://localhost:13000/action=AddTask&Type=WavToText&File=Speech.wav&Out=Text.ctm&TaskManagers=3

Implications of DNN Acoustic Modeling

The 10.8 version of IDOL Speech Server (with version 6.0+ Language Packs) introduced DNN acoustic models (or related technology) to improve speech-to-text accuracy.

Each language pack contains up to three DNN acoustic models of different sizes. Each model shares the same input and output layer, so they are functionally interchangeable.

DNN acoustic modeling in fixed mode

In fixed mode, the default option is to use the largest, most accurate DNN file. Alternatively, you can specify a smaller DNN file to get faster performance, by using either the DnnFile action parameter or by editing the configuration file.

DNN acoustic modeling in relative mode

Because relative mode involves targeting a specific speed, several limitations are in place. To run speech-to-text with DNN acoustic models in relative mode, your hardware must meet certain criteria. In particular, you must use an Intel (or compatible) processor that supports SIMD extensions SSSE3 and SSE4.1. If not, real-time operation is still possible without hardware limitations by setting DnnFile to none in the action (or in the configuration file).

Generally speaking, the recognition speed with or without DNN acoustic modeling is similar. However, DNN recognition involves a certain fixed amount of work regardless of the desired speed, so it cannot achieve the same best case speed. Because of this, there is a higher minimum target speed for relative mode. This minimum target speed depends on several factors such as your hardware and the size of the available DNN networks. If you target a speed that your hardware cannot deliver, Speech Server returns an error message.

The three DNN files are used interchangeably in relative mode, in a process like the use of gears in a car. When the speech-to-text process starts, Speech Server uses the medium-sized DNN file (where available). If progress is relatively fast compared to the target speed, Speech Server switches to the larger DNN file to improve performance. Alternatively, if progress is slow, Speech Server uses the smaller, faster DNN file instead. Speech Server always uses the most appropriate DNN file to get the best possible recognition performance in the time available.

For more information on how to configure the three different DNN acoustic models, refer to the IDOL Speech Server Reference and the IDOL Speech Server Administration Guide.


_HP_HTML5_bannerTitle.htm