Build the Language Model

After you have selected and prepared the training text files, you can build the custom language model.

To build the language model

  1. Create a list that contains the file names (including file extensions) of all training text files. You do not have to include the file paths because you can use the DataPath parameter to specify the directory path in the next step.

    For more information about IDOL Speech Server's list manager, see Create and Manage Lists.

  2. Send an AddTask action to IDOL Speech Server, and set the following parameters:

    Type The task name. Set to LanguageModelBuild.
    ContentDatabase The IDOL Content component database to retrieve text data from. This parameter has an effect only if you set ContentHost. If you set ContentHost but do not set this parameter, IDOL Speech Server retrieves text from all databases.
    ContentHost The host name or IP address of the IDOL Content component that you want to retrieve training text data from.
    ContentPort The ACI port of the IDOL Content component that you want to retrieve training text data from. This parameter has an effect only if you set ContentHost. By default, IDOL Speech Server uses port 9100 to contact the IDOL Content component.
    ContentTextTag The IDOL document fields that contain the text that you want to use to train the language model. Separate multiple field names with spaces, commas, or plus symbols (+).
    By default, IDOL Speech Server uses the content of the DRECONTENT document field as training text.
    DataList

    The list that specifies the training text files.

    DataPath The path to the directory that contains the files specified in the DataList parameter.
    KeepList The path to a file that contains a list of words that the language model must contain. For more information on the format of the file, see the IDOL Speech Server Reference.
    Lang The language pack to use as a base (for example, ENUK-tel).
    NewLanguageModel The name to give the custom language model that is generated. You must include the file extension (.tlm) in the parameter.
    NewLmInfoFile

    The output Language Model Information file name. If you set this parameter, you must include the file extension (.lmi) in the parameter.

    If you do not set this parameter, the file has the same as the generated language model (and is located in the same directory), but with the extension .lmi instead of .tlm.

    NOTE:

    You can use the GetResults action to retrieve the .lmi file by setting the Label parameter to lmi.

    NewDictionary

    The name of the dictionary to generate; usually it is the same value as NewLanguageModel. If you set this parameter, you must include the file extension (.dct.sz) in the parameter.

    If you do not set the NewDictionary parameter, Speech Server uses the output language model file name specified as the value of the NewLanguageModel parameter, but with the extension .dct.sz rather than .tlm.

    DoSmoothing If you are using a custom language model for a transcript alignment task, set DoSmoothing to False. Otherwise, you can use the default value of True.

    If the training text files contain Japanese, Korean, Mandarin, or Taiwanese Mandarin languages, set the DoSegment parameter.

    DoSegment Set to True to enable text segmentation.

For example:

http://localhost:15000/action=AddTask&Type=LanguageModelBuild&DataList=ListManager/Langmodel&DataPath=C:\LanguageModelFiles&Lang=ENUK-tel&NewLanguageModel=mymodel.tlm

This action uses the training text specified in the Langmodel list and the ENUK-tel language pack to build a new language model and dictionary file, both named mymodel. This action also calculates a recommended interpolation weight at the end of the language model building process.

NOTE:

The interpolation weight is only a suggested weight–you can choose to set other weights.

The new language models are placed in the custom language models folder that is specified by the CustomLmDir parameter in the IDOL Speech Server configuration file.

This action returns a token. You can use the token to:


_FT_HTML5_bannerTitle.htm