HPE recommends iterative alignment if the alignment quality is poor or if large sections of audio have not been aligned. This situation can arise when aligning very long audio. In iterative alignments, alignment occurs over two or more steps:
Use TranscriptCheck
to produce approximate timings for the transcript.
(Optional) Perform alignment at the word level.
Use the TranscriptAlign
task with the MatchType
configuration parameter value set to words
to align the audio. Retrieve the alignment output in a .ctm
file format.
Convert the .ctm
file to use it as the transcript text.
Perform aligment at the prons
level, either by using the TranscriptCheck
approximate transcript time output file, or the modified output from the optional transcript alignment at word level.
Converting the .ctm file involves normalization and ensuring that there is only one word on each line, for example:
Article one All human beings are born free
You can optionally follow words with a pair of numbers that specify the earliest start time and latest end time in seconds at which the word can appear in the aligned output, for example:
Article
|
0.000
|
1.000
|
one
|
0.000
|
1.000
|
All
|
0.000
|
1.000
|
human
|
0.500
|
1.500
|
beings
|
0.500
|
1.500
|
are
|
1.000
|
2.000
|
born
|
1.000
|
2.000
|
free
|
1.000
|
2.000
|
This example indicates that the word Article must appear between 0.000 and 1.000 seconds in the aligned output, human must appear between 0.500 and 1.500 seconds, and so on.
HPE IDOL Speech Server cannot perform this step automatically. HPE recommends that you subtract a small amount of time from the word start positions and add it to the word end positions generated by the initial alignment. This step allows the second alignment stage to make small adjustments to the word start and end points.
|