Speaker Identification

HPE IDOL Speech Server can perform speaker identification, as well as providing some processing toward speaker diarization. It does not perform speaker verification.

The speaker identification system also includes a rejection methodology for open-set identifcation, in the form of score thresholds for each speaker. Only when a template scores above this threshold for a section can a hit can be considered genuine. These thresholds use both speech data from the speaker (positive examples), as well as from a selection of a range of ‘other’ speakers (negative examples). HPE recommends that the amount of ‘other’ speaker training data totals at least 60 minutes.

Speaker identification involves:

The base speaker ID pack that is included in the HPE IDOL Speech Server installation contains a Universal Background Model (UBM). HPE IDOL Speech Server uses the UBM file as the base to train speaker templates. Because the template models are trained from an established base model, the template is reliable even when training data is sparse. The UBM is also used in calculating the speaker scores, leading to more consistent score ranges.

Although you can train a speaker template from relatively little data, HPE recommends that you train with at least five minutes of sample audio for each speaker, and more if available. Use good quality audio samples that contain only the speaker’s voice and no significant background noise. The spoken content can contain any vocabulary.

HPE recommends that you create no more than 100 speaker templates. Although there is no fixed limit to the number of speakers, larger speaker sets are likely to lead to an increase in classification errors.

Speaker identification is text independent, meaning that it can identify speakers speaking any content.

In some aspects, speaker identification is opposite in nature to speaker independent speech-to-text. In the former, HPE IDOL Speech Server tries to identify differences in speaker voices. In the latter, HPE IDOL Speech Server attempts to remove individual speaker differences so that the speech-to-text is robust across multiple speakers.

Speaker identification is impacted by:


_HP_HTML5_bannerTitle.htm