The process of running language identification tasks is very similar regardless of which mode it is run in. As such, the bulk of this section focuses on segmented identification, with significant differences for other modes described where appropriate.
To identify the languages in streamed audio, use the LangId
task, with LIDMode
set to the appropriate language identification mode. For more information about this standard tasks, see the IDOL Speech Server Reference.
Use the following procedure to identify the languages in an audio file.
To identify languages in an audio file
Create a list that contains the file names (including file extensions) of the classifiers to use.
For more information about IDOL Speech Server's list manager, see Create and Manage Lists.
Send an AddTask
action to IDOL Speech Server, and set the following parameters:
Type
|
The task name. Set to LangId . |
LidMode
|
The mode to use. Set to Segmented for segmented mode (this is the default), Boundary for boundary mode, or Cumulative for cumulative mode. |
File
|
The audio file to process. To restrict processing to a section of the audio file, set the |
Out
|
The file to write the language identification results to. |
If you want to change the audio sample rate, or if you want to use your own custom classifiers, you must also set the ClassList
parameter. You might also need to specify the ClassPath
parameter, depending on the location of the classifier files. See the IDOL Speech Server Reference for more information.
LangList
configuration parameter in the langid
module.If you want to use open set language identification, you must also set the ClosedSet
parameter to False
. For more information about open set language identification, see Open Set Language Identification and the IDOL Speech Server Reference.
For example:
http://localhost:15000/action=AddTask&Type=LangId&LIDMode=Segmented&File=C:\Data\Speech.wav&ClassList=ListManager\OptClassSet&ClassPath=C:\LangID\&Out=SpeechLang1.ctm
This action identifies languages in the Speech.wav
file using the language classifiers specified in the OptClassSet
list, and writes the identification results to the SpeechLang1.ctm
file.
This action returns a token. You can use the token to:
IDOL Speech Server displays the results in XML format in your web browser. You can also open the .ctm file from the configured IDOL Speech Server temporary directory (or another location if you specified a path in the Out
parameter).
The following is an example of the .ctm output produced by the LangId
task in Segmented
mode.
1
|
L1
|
0.00
|
30.58
|
English
|
1.000
|
1.252
|
1
|
L2
|
0.00
|
30.58
|
German
|
0.686
|
1.252
|
1
|
L3
|
0.00
|
30.58
|
French
|
0.550
|
1.252
|
1
|
L1
|
30.58
|
28.30
|
German
|
1.000
|
1.306
|
1
|
L2
|
30.58
|
28.30
|
English
|
0.562
|
1.306
|
1
|
L3
|
30.58
|
28.30
|
Italian
|
0.517
|
1.306
|
1
|
L1
|
58.88
|
31.12
|
English
|
1.000
|
1.295
|
1
|
L2
|
58.88
|
31.12
|
French
|
0.680
|
1.295
|
1
|
L3
|
58.88
|
31.12
|
German
|
0.511
|
1.295
|
From left to right, the columns in the .ctm file contain:
1
)L1
is the top result, L2
the next best, and so on)0.0
to 1.0
; otherwise a log score is reported) 1.0
and above–the higher the score, the more confident the system is that L1
is the correct answer)The example shows a 90-second file being recognized in segments, each approximately 30 seconds in duration. For the first segment, English is the language that is identified as being the most likely (L1
), followed by German (L2
) and French (L3
). For the next segment, German has the highest confidence score. For the final segment, English has the highest confidence score again.
In Segmented
mode, it is common to see different results for each segment, because the language might change throughout the file. Cumulative
mode assesses the most dominant language across the whole file, so you would not expect to see these changes.
The following example shows some of the same information displayed in XML format.
<lid_transcript> <lid_record> <start>0.000</start> <end>30.580</end> <label>English</label> <score>1.000</score> <confidence>1.252</confidence> <rank>1</rank> </lid_record> <lid_record> <start>0.000</start> <end>30.580</end> <label>German</label> <score>0.686</score> <confidence>1.252</confidence> <rank>2</rank> </lid_record> </lid_transcript>
This output format is common to the Segmented
and Cumulative
modes. The output format for Boundary
mode is similar, but the time points occur whenever a language change is detected, instead of after a fixed time period.
|