Recombine Word Fragments

Certain speech-to-text language packs, such as Hebrew (HBIL), are based on vocabulary that has been broken down into its smallest parts to maintain a feasible vocabulary size. Using one of these language packs to perform speech-to-text on audio can produce results that contain word fragments that need to be joined back together in the final results. The postproc module can recombine these fragments into complete words. If adjacent words in the results file include hyphens as an indication of a word break, the module treats these as prefixes or suffixes and joins them to the stem of the word. For example, the postproc module could receive the following word sequence (shown in CTM format):

1 A 0.000 0.351 a 0.513
1 A 0.351 0.194 pre- 0.325
1 A 0.545 0.419 exist 0.457
1 A 0.964 0.140 -ing 0.621
1 A 1.104 0.855 condition 0.369

The module would combine the prefix “pre-“, the stem “exist”, and the suffix “-ing”, as shown in the following example:

1 A 0.000 0.351 a 0.513
1 A 0.351 0.753 preexisting 0.457
1 A 1.104 0.855 condition 0.369

Speech-to-text results can contain errors, potentially leading to word fragments that would combine to form invalid words. To avoid producing invalid words, you can supply the postproc module with a list of all valid words for a language (this list file is provided in the language pack). The module then combines word fragments only if they form words that are in the list. The other word fragments are left uncombined.

To specify valid words for a language

If you do not specify a valid words list, the module combines word fragments wherever indicated by hyphens, without attempting to validate the resulting words.


_FT_HTML5_bannerTitle.htm