Training Text for Custom Language Models

To produce an effective custom language model, you must build it using text that resembles the data that you want to process. For example, if you intend to apply the speech-to-text task to news monitoring, you would train the language model using recent news articles gathered from a wide range of sources.

The standard IDOL Speech Server English language model is constructed using text that contains many billions of words and covering a wide range of topics. Such wide coverage significantly reduces the amount of text required to build the custom language model. In the deployment, the standard language model is used together with the custom language model interpolated with an appropriate weight.

The amount of text required to build a custom language model can vary from a few thousand words to several hundred thousand words, depending on the topic. Generally, the more text that is used to build the custom language model, the more accurate the model is. However, the gains in accuracy start tapering off beyond a certain number of words. The number of words depends on the size of the topic; for a typical topic (for example, technical support), the tapering might begin around 100,000 words.


_FT_HTML5_bannerTitle.htm