Use Speaker Clustering

You can use speech clustering to segment a speech waveform and separate it out into a number of speakers. HPE IDOL Speech Server produces a timed sequence of labels that correspond to speaker assignments.

HPE IDOL Speech Server provides the following speech clustering tasks:

For more information on the speaker clustering tasks, see the HPE IDOL Speech Server Reference.

The SplitSpeech Process

The key process in clustering speech is called SplitSpeech. This process requires:

The SplitSpeech process uses an agglomerative algorithm to find the best two segment clusters to merge. This process is then repeated until all potential merges fail a Bayesian Information Criterion threshold check. The final process should result in a smaller number of acoustically homogeneous speaker clusters.

Control the Number of Speakers

You can specify the minimum and maximum numbers of speakers to produce. For example, if you know that a telephone call consists of two speakers, you can use the MinNumSpeakers and MaxNumSpeakers parameters to set both the minimum and maximum number of speakers to 2 to guarantee that the process produces exactly this number of speakers.

Alternatively, you can algorithmically determine the number of speakers based on a sensitivity threshold. In this case, you can change the MergeThresh parameter from the default value of 0.0 to ensure that more or fewer speakers are produced.

For more information on these parameters, see the HPE IDOL Speech Server Reference.

Processing Time

The recursive process of splitting speech can be very resource-intensive. If you are processing audio files longer than 10 minutes, and those files have consistent speakers, HPE recommends that you crystallize speaker information after approximately 10 minutes. For example, if your audio file is 30 minutes in length, and you crystallize the speaker information after five minutes, the process clusters the speakers in the first 5 minutes of the file. Thereafter, any subsequent speech segments are clustered into one of the speaker segments from the first five minutes of the file, rather than being assigned to new speakers. This crystallization means that classification of subsequent speech segments is faster.

You can also configure processing depending on whether accuracy or speed is more important. For instance, full matrices are more accurate, but slower; if processing speed is more important to you than accuracy, you could use diagonal covariance matrices (which are faster but less accurate) instead.

For more information on using the DiagCov and FixTime parameters, see the HPE IDOL Speech Server Reference.


_HP_HTML5_bannerTitle.htm