Phonetic Phrase Search

Phonetic phrase search is a fast and approximate method for finding sections in audio data that sound similar to the phrase being searched for. Typically, phonetic phrase searches can run many times faster than performing speech-to-text and then searching the transcript.

HPE HPE IDOL Speech Server first processes audio data to extract phonemes, and produces a time track charting the position of phonemes in the audio file. The time tracks for different phonemes overlap; as a result, each time position can appear in more than one phoneme time track.

HPE IDOL Speech Server can then search through the time tracks to match selected words or phrases. It identifies possible time positions for the item being searched, together with a hit score.

Lowering the hit score threshold increases the number of identification hits but also increases the number of false positive results, leading to an increased recall rate but decreased precision. Raising the hit score threshold has the opposite effect. You can use the hit scores to trade off false positives against false negatives.

You can also perform the search in two stages, where an initial action processes the audio file to create the time tracks and a second, separate action searches these time tracks. The details of these are covered in Phonetic Phrase Search.

This table compares phonetic phrase search and speech-to-text based search.

Phonetic search Speech-to-text and search
Search usually occurs in two stages: the first stage creates the phoneme time track, the second stage searches the time track for phoneme sequences that resemble the specified phrase or phrases. Search usually occurs in two stages: the first stage is speech-to-text, the second stage searches the transcript.
The first stage runs many times faster than speech-to-text. The second stage runs at about 200-600 times faster than real time. Speech-to-text runs at real time, or a few times faster than real time. The second stage is instantaneous.
The first stage is regarded as the ‘ingestion’ stage, and occurs only once for each set of audio data. It is independent of the search stage. The first stage is regarded as the ‘ingestion’ stage, and occurs only once for each set of audio data. It is independent of the search stage.
Can generate many false positive results, meaning that the search returns phoneme sequences that do not match the searched-for phrase. However, you can alter the hit score threshold to alter the rate of false positives and false negatives, to an extent. Can generate many false negative results, meaning that the search can miss some instances of the searched-for phrase that are present in the data. There is little scope for altering this.
The non-specificity of phonemes contributes to the false positives obtained. The search does not use linguistic information. The language model used in speech-to-text means that linguistic information is used in the search process.
Generally chosen when there is a large quantity of data to search, or when audio quality is too poor to obtain good speech-to-text search results. Generally preferred when audio quality is reasonable and there is hardware available to process the data.

Phonetic phrase search is language dependent and requires an appropriate language pack.

Adapting the acoustic model (see Acoustic Models) can improve phonetic phrase search performance.


_HP_HTML5_bannerTitle.htm