After you have trained a set of speaker templates, you can analyze audio to identify any sections where the trained speakers are present. To process an audio file, use the SpkIdEvalWav
task. To process an audio stream, use the SpkIdEvalStream
task.
You can specify the audio templates to use either by specifying a list as the value of the TemplateList
parameter, or by specifying a template set as the value of the TemplateSet
parameter. If you do not set any templates, IDOL Speech Server performs speaker segmentation and gender identification, but with no speaker labels.
To identify speakers in an audio file
Send an AddTask
action to IDOL Speech Server, and set the following parameters:
Type
|
The task name. Set to SpkIdEvalWav . |
File
|
The audio file to process. |
TemplateSet
|
The speaker template set file to use. |
TemplateList
|
A list file that specifies a set of templates to use (if a set file is not specified in the |
ClosedSet
|
Whether this is a closed-set test (by default, this parameter is set to |
Out
|
The file to write the speaker identification results to. |
You can set additional parameters. For details of the optional parameters, see the IDOL Speech Server Reference.
For example:
http://localhost:15000/action=AddTask&Type=SpkIdEvalWav&File=C:\Data\Speech.wav&TemplateSet=speakers.ats&ClosedSet=False&Out=results.ctm
This action uses port 15000
to instruct IDOL Speech Server, which is located on the local machine, to search the Speech.wav
file for speakers based on the template set file speakers.ats
, and to write the identification results to the results.ctm
file. Because the test is set to be open-set, IDOL Speech Server marks sections where no speaker scores above their respective thresholds as Unknown_
.
This action returns a token. You can use the token to:
IDOL Speech Server supports two speaker identification output formats: CTM and XML.
The following example shows CTM output produced by the SpkIdEvalWav
task.
1
|
A
|
0.000
|
0.520
|
Unknown_
|
NonSpeech_
|
0.000
|
1
|
A
|
0.520
|
10.030
|
Brown
|
MALE
|
3.540
|
1
|
A
|
10.550
|
0.080
|
Unknown_
|
NonSpeech_
|
0.000
|
1
|
A
|
10.630
|
9.460
|
Unknown_
|
FEMALE
|
0.000
|
1
|
A
|
20.090
|
6.150
|
Smith
|
MALE
|
6.983
|
From left to right, the columns in the CTM file contain:
1
)A
Unknown_
.NonSpeech_
, if a non-speech segment)0.000
if non-speech or an unknown speaker) Note: The score for an identified speaker represents how well the processed speech matches the template. Scores can be negative or positive depending on the type of score normalization used, but in all cases a higher value represents a score that is closer to the model.
The following example shows XML output with the mode set to default:
<sid_transcript> <sid_record> <start>0.000</start> <end>0.520</end> <label>Unknown_</label> <gender>NonSpeech_</gender> <score>0.000</score> </sid_record> <sid_record> <start>0.520</start> <end>10.550</end> <label>Brown</label> <gender>MALE</gender> <score>3.540</score> </sid_record> </sid_transcript>
|