Control Speech-to-Text Process Speed

Speech-to-text can run in one of two primary modes–fixed and relative.

In fixed mode, the speech-to-text engine attempts to produce a ‘uniform quality’ transcription, where the transcription is performed to roughly the same quality across the entire speech data regardless of how long it takes. However, the progress of the speech-to-text can fluctuate depending on factors related to the data. For example, noisy or poorly discernible data requires more CPU usage to compute compared with ‘clean’ data.

In relative mode, IDOL Speech Server aims to crunch through the audio data at a uniform pace dictated by a specified target rate.

Micro Focus recommends that for most deployments you use the fixed mode unless there is a good reason not to. Although relative mode offers time guarantees, recognition quality can sometimes suffer if the CPU cannot keep up with the target rate. However, if you are processing live data, Micro Focus recommends that you use relative mode. For information about how to specify the mode, see Use Live Mode for Streaming.

The default mode is fixed, with a default mode value of 4. You can specify a mode value between 1 and 4, trading speed against accuracy. Specifying a value of 1 results in fast processing, with a potentially lower accuracy. Specifying a value of 4 results in the most accurate analysis but can take longer to process.

In the relative mode, the mode value can range from 0.5 to 2.0. These mode values represent the data processing rate compared to real time. A mode value of 1.0 means that the speech data is processed at almost real time. A mode value of 0.5 processes the data twice as fast as real time. If your target value is relatively high, the process might finish earlier than requested, if the process determines that spending more time on recognition would not give better results.

NOTE:

If you want to use the SpeedBiasLevel functionality, you must run speech-to-text in fixed mode. See Run the Task for more information.

Versions of IDOL Speech Server from 10.8 and later and the 6.0+ versions of the language packs use DNN acoustic models to improve speech-to-text accuracy. Each language pack contains at least two (typically three) DNN acoustic models of different sizes. In fixed mode, the default option is to use the largest, most accurate DNN file. If you use relative mode, the default option selects a smaller, faster DNN acoustic file.

IDOL Speech Server version 11.5 and later, and the 9.0+ versions of the language packs, use a new Neural Network technology to improve recognition accuracy. These language packs contain only one DNN file, which it uses in both fixed and relative mode. This option is typically faster than the older DNN technology. The new DNN files also reduce the benefit of using small DNN files to improve speech-to-text processing speed, because other options are now available (such as FrameDupl, and SpeedBiasLevel).

To change the default settings, use the DNNFile command line parameter, or edit the configuration file.

CAUTION:

You can use DNN acoustic modeling in relative mode only if your DNN files are smaller than a certain size. In addition, you must be using Intel (or compatible) processors that support SIMD extensions SSSE3 and SSE4.1. If this is not possible, you can set the DNNFile parameter to none to allow non-DNN speech-to-text without hardware limitations.

TIP:

In relative mode, IDOL Speech Server must ensure that it can meet a specified operational speed. It determines a minimum permitted target speed based on the size of the smallest DNN in the language pack, the frame duplication approximation value, and the speed/SIMD capability of the processor.

If you attempt to use a target speed below the allowed minimum, IDOL Speech Server returns an error message, which states the minimum value.


_FT_HTML5_bannerTitle.htm