Speech-to-text can run in one of two primary modes–fixed
and relative
.
In fixed
mode, the speech-to-text engine attempts to produce a ‘uniform quality’ transcription, where the transcription is performed to roughly the same quality across the entire speech data regardless of how long it takes. However, the progress of the speech-to-text can fluctuate depending on factors related to the data. For example, noisy or poorly discernible data requires more CPU usage to compute compared with ‘clean’ data.
In relative
mode, HPE IDOL Speech Server aims to crunch through the audio data at a uniform pace dictated by a specified target rate.
HPE recommends that for most deployments you use the fixed
mode unless there is a good reason not to. Although relative
mode offers time guarantees, recognition quality can sometimes suffer if the CPU cannot keep up with the target rate. However, if you are processing live data, HPE recommends that you use relative
mode. For information about how to specify the mode, see Use Live Mode for Streaming.
The default mode is fixed
, with a default mode value of 4
. You can specify a mode value between 1
and 4
, trading speed against accuracy. Specifying a value of 1
results in fast processing, with a potentially lower accuracy. Specifying a value of 4
results in the most accurate analysis but can take longer to process.
In the relative
mode, the mode value can range from 0.5
to 2.0
. These mode values represent the data processing rate compared to real time. A mode value of 1.0
means that the speech data is processed at almost real time. A mode value of 0.5
processes the data twice as fast as real time.
Versions of HPE IDOL Speech Server from 10.8 upwards and the 6.0+ versions of the language packs use DNN acoustic models to improve speech-to-text accuracy. Each language pack contains at least two DNN acoustic models of different sizes. In fixed mode, the default option is to use the largest, most accurate DNN file. If you are using relative mode, the default option selects a smaller, faster DNN acoustic file.
To change the default settings, use the DnnFile
command line parameter, or edit the configuration file.
Caution: You can use DNN acoustic modeling in relative mode only if your DNN files are smaller than a certain size. In addition, you must be using Intel (or compatible) processors that support SIMD extensions SSSE3 and SSE4.1. If this is not possible, you can set the DnnFile
parameter to none
to allow non-DNN speech-to-text without hardware limitiations.
Tip: DNN recognition is in general similar in speed, but has a lower maximum speed than GMM recognition, because DNN propagation requires all nodes to be visited. Because of this, the minimum mode value is likely to be higher than is possible with DNN recognition. This minimum value varies on a case-by-case basis based on the size of the smallest DNN in the language pack, but is typically between 0.5 and 0.9.
|