The process of running language identification tasks is very similar regardless of which mode it is run in. As such, the bulk of this section focuses on segmented identification, with significant differences for other modes described where appropriate.
To identify the languages in streamed audio, use one of the LangIdBndStream
, LangIdCumStream
, or LangIdSegStream
tasks, depending on the mode that you want to use. If you already have language identification feature (.lif) files, use one of the LangIdBndLif
, LangIdCumLif
, or LangIdSegLif
tasks. For details about these standard tasks, see the HPE IDOL Speech Server Reference.
Use the following procedure to identify the languages in an audio file.
To identify languages in an audio file
Create a list that contains the file names (including file extensions) of the classifiers to use.
For more information about HPE IDOL Speech Server's list manager, see Create and Manage Lists.
Send an AddTask
action to HPE IDOL Speech Server, and set the following parameters:
Type
|
The task name. Set to LangIdSegWav for segmented mode, LangIdBndWav for boundary mode, or LangIdCumWav for cumulative mode. |
File
|
The audio file to process. To restrict processing to a section of the audio file, set the start and end times in the |
Out
|
The file to write the language identification results to. |
Note: If you want to change the audio sample rate, or if you want to use your own custom classifiers, you must also set the ClassList
parameter. You might also need to specify the ClassPath
parameter, depending on the location of the classifier files. See the HPE IDOL Speech Server Reference for more information.
Note: If you use the base classifier pack, set the languages that you want to identify in the LangList
configuration parameter in the langid
module.
If you set Type
to LangIdBndWav
, you must also set the OutB
parameter.
OutB
|
The file to write the boundary point information to. |
For example:
http://localhost:13000/action=AddTask&Type=LangIdSegWav&File=C:\Data\Speech.wav&ClassList=ListManager\OptClassSet&ClassPath=C:\LangID\&Out=SpeechLang1.ctm
This action uses port 13000
to instruct HPE IDOL Speech Server, which is located on the local machine, to identify languages in the Speech.wav
file using the language classifiers specified in the OptClassSet
list, and to write the identification results to the SpeechLang1.ctm
file.
This action returns a token. You can use the token to:
HPE IDOL Speech Server displays the results in XML format in your web browser. You can also open the .ctm file from the configured HPE IDOL Speech Server temporary directory (or another location if you specified a path in the Out
parameter).
The following is an example of the .ctm output produced by the LangIdSegWav
task.
1
|
L1
|
0.00
|
30.58
|
English
|
1.000
|
1.252
|
1
|
L2
|
0.00
|
30.58
|
German
|
0.686
|
1.252
|
1
|
L3
|
0.00
|
30.58
|
French
|
0.550
|
1.252
|
1
|
L1
|
30.58
|
28.30
|
German
|
1.000
|
1.306
|
1
|
L2
|
30.58
|
28.30
|
English
|
0.562
|
1.306
|
1
|
L3
|
30.58
|
28.30
|
Italian
|
0.517
|
1.306
|
1
|
L1
|
58.88
|
31.12
|
English
|
1.000
|
1.295
|
1
|
L2
|
58.88
|
31.12
|
French
|
0.680
|
1.295
|
1
|
L3
|
58.88
|
31.12
|
German
|
0.511
|
1.295
|
From left to right, the columns in the .ctm file contain:
1
)L1
is the top result, L2
the next best, and so on)0.0
to 1.0
; otherwise a log score is reported) 1.0
and above–the higher the score, the more confident the system is that L1
is the correct answer)The example shows a 90-second file being recognized in segments, each approximately 30 seconds in duration. For the first segment, English is the language that is identified as being the most likely (L1
), followed by German (L2
) and French (L3
). For the next segment, German has the highest confidence score. For the final segment, English has the highest confidence score again. In SEGMENTED
mode, it is common to see different results for each segment, because the language might change throughout the file. CUMULATIVE
mode assesses the most dominant language across the whole file, so you would not expect to see these changes.
The following example shows some of the same information displayed in XML format.
<lid_transcript> <lid_record> <start>0.000</start> <end>30.580</end> <label>English</label> <score>1.000</score> <confidence>1.252</confidence> <rank>1</rank> </lid_record> <lid_record> <start>0.000</start> <end>30.580</end> <label>German</label> <score>0.686</score> <confidence>1.252</confidence> <rank>2</rank> </lid_record> </lid_transcript>
This output format is common to the SEGMENTED
and CUMULATIVE
modes. For the BOUNDARY
mode, the output is a little different. The main results file, specified by the Out
parameter, is unchanged in format. However, the time points occur whenever a language change is detected, instead of after a fixed time period. The BOUNDARY
mode also produces an extra results file, specified by the OutB
parameter. To retrieve this file, send a GetResults
action that includes the parameter Label=OutB
. The file provides information on the boundary change points only. For example:
1 X 21.97 21.97 English_French 1.000 1 X 61.82 61.82 French_English 1.000
From left to right, the columns in this file contain:
1
X
1.0
)The following example shows the same results displayed in XML format:
<lib_transcript> <lib_record> <time>21.970</time> <from>English</from> <to>French</to> </lib_record> <lib_record> <time>61.820</time> <from>French</from> <to>English</to> </lib_record </lib_transcript>
The XML shows that the language changed from English to French after 21.970 seconds, and then from French back into English after 61.820 seconds.
|