Order of Language Processes

This section describes the order that IDOL Server performs the various language processing steps during indexing and at query time.

At Index Time

During the index process, IDOL performs its language-dependent processing in the following order:

  1. Automatic Language Detection (ALD) and content enrichment such as sentiment analysis. These methods process the original text.

  2. Sentence breaking and NGram tokenization. These libraries are used primarily for Asian languages, where gaps between words are not defined by spaces and punctuation marks.

    NOTE:

    Sentence breaking can, depending on the library and configuration, result in the text being tokenized, normalized and even stemmed, which can affect the subsequent processing below.

  3. Tokenization. This step converts the original text into sequences of tokens used for matching queries against. Tokenization matches characters as text characters, separator characters or non-separator characters depending on configured per-language rules. The rules are applied in the following order of precedence, meaning that characters will match rules at the top of the table in preference to those at the bottom:

    Rule Example Value Effect
    HyphenChars . joe.smith tokenized as joe, smith and joesmith
    NumberPunctuation . 3.14 tokenized as 3.14, not 3 and 14
    TangibleCharacters + c++ tokenized as c++, not c
    SoftSeparators 1234567890 123 tokenized as 1, 2, 3, not 123
    AugmentSeparators . joe.smith tokenized as joe, smith
  4. Normalization and transliteration. This process maps characters into a canonical form so that a search for different forms matches a term containing any of these forms. This happens using built-in single language or cross-linguistic rules.

  5. Stop word detection. This process identifies words that are deemed to play no useful part in the retrieval process and can be ignored at both index and query time. These are configured per-language or cross-linguistically in sets of words called stoplists. Stop lists should contain words that have been tokenized and transliterated but not stemmed:

    Good Stop Word Bad Stop Word Reason
    dont don't Stop word detection occurs after tokenization.
    cafe café Stop word detection occurs after transliteration.
    site + sites site Stop word detection occurs before stemming.
  6. Stemming. This step reduces words to their linguistic root, to enable searches to match all forms of a term. This can be done using per-language rules or using generic cross-language rules. New stem rules should contain words that have been tokenized and transliterated:

    Good Stemming Rule Bad Stemming Rule Reason
    formulae → formula formulæ → formula Stemming occurs after transliteration
  7. Decomposition. This step breaks up single compounded words into multiple parts, to enable searches to match subparts of the word. Decomposition rules are particularly useful in languages such as German and Hungarian. New decomposition rules should contain compound words that are tokenized, transliterated and stemmed:

    Good Decomposition Rule Bad Decomposition Rule Reason
    mousetrap → mouse trap mousetrap → mouse trap + mousetraps → mouse traps Decomposition occurs after stemming.

For Queries

When handling queries, IDOL Server performs the same sequence of operations as during indexing described above, with a few small modifications:


_FT_HTML5_bannerTitle.htm