Advanced Tokenization

This section discusses some exceptions to the usual language tokenization. You might need to consider these sections if you have documents in Asian languages, or if the standard IDOL Server configuration options do not provide adequate support for your custom scenarios.

Sentence Breaking

In most writing systems, spaces and punctuation characters define the gap between one word token and the next. IDOL Server uses this information to divide text up into searchable tokens.

In Chinese, Japanese, Korean, and Thai languages, there are no spaces between words in the text, and each word symbol is a multiple byte character. This property of the text stream means that IDOL Server cannot easily split the text up into searchable tokens.

For these languages, additional rules and methods exist to divide the text up into tokens. The IDOL Server installation includes these methods in sentence breaking libraries, which are specific to a particular language. You can also install and use a sentence breaking library that uses the tokenization module written by Basis Technology.

You can also use stemming files with the sentence breaking libraries to allow you to create user-defined dictionaries, which allow you to define how the sentence breaking library handles new words (such as new company names), and words specific to your use case (such as product names or jargon). You can also create user dictionaries with the Basis Sentence Breaking Libraries. For more information, refer to the Basis Sentence Breaking Technical Note.

Custom Tokenization

In rare cases, you might want to use a different tokenization system to the one that IDOL Server uses as standard, for example if the default stemming rules are not appropriate for your documents.

In these cases, you can create a custom sentence breaking library. You use the sentence breaking library to define the language-specific rules for custom tokenization in your language. For more information, refer to the IDOL Sentence Breaking API Technical Note.

N-Gram Tokenization

N-Gram tokenization is another approach that you can use for non-tokenized languages such as Chinese, Japanese, Korean and Thai. With N-Grams, IDOL Server splits all text up into tokens that contain N characters, where N is an integer value that you can configure.

This method is primarily for simple tokenization of Asian languages.

You can also use this option to provide consistent tokenization between multiple languages (Cross-Lingual Search). For example, if you want to be able to search for both Chinese and Japanese text in a single query, you must ensure that tokenization is consistent for both languages. Sentence breaking libraries are language-specific, so you must use NGram tokenization instead.

When you have documents that contain text from multiple languages, you can use the NGramMultibyteOnly and NGramOrientalOnly configuration parameters. These parameters allow you to limit the N-Gram tokenization to only multibyte characters, or only Chinese, Japanese, and Korean characters.

 

Related Topics


_FT_HTML5_bannerTitle.htm