This section describes the order that IDOL Server performs the various language processing steps during indexing and at query time.
During the index process, IDOL performs its language-dependent processing in the following order:
Automatic Language Detection (ALD) and content enrichment such as sentiment analysis. These methods process the original text.
Sentence breaking and NGram tokenization. These libraries are used primarily for Asian languages, where gaps between words are not defined by spaces and punctuation marks.
Sentence breaking can, depending on the library and configuration, result in the text being tokenized, normalized and even stemmed, which can affect the subsequent processing below.
Tokenization. This step converts the original text into sequences of tokens used for matching queries against. Tokenization matches characters as text characters, separator characters or non-separator characters depending on configured per-language rules. The rules are applied in the following order of precedence, meaning that characters will match rules at the top of the table in preference to those at the bottom:
Rule | Example Value | Effect |
---|---|---|
HyphenChars
|
. | joe.smith tokenized as joe, smith and joesmith |
NumberPunctuation
|
. | 3.14 tokenized as 3.14, not 3 and 14 |
TangibleCharacters
|
+ | c++ tokenized as c++, not c |
SoftSeparators
|
1234567890 | 123 tokenized as 1, 2, 3, not 123 |
AugmentSeparators
|
. | joe.smith tokenized as joe, smith |
Normalization and transliteration. This process maps characters into a canonical form so that a search for different forms matches a term containing any of these forms. This happens using built-in single language or cross-linguistic rules.
Stop word detection. This process identifies words that are deemed to play no useful part in the retrieval process and can be ignored at both index and query time. These are configured per-language or cross-linguistically in sets of words called stoplists. Stop lists should contain words that have been tokenized and transliterated but not stemmed:
Good Stop Word | Bad Stop Word | Reason |
---|---|---|
dont | don't | Stop word detection occurs after tokenization. |
cafe | café | Stop word detection occurs after transliteration. |
site + sites | site | Stop word detection occurs before stemming. |
Stemming. This step reduces words to their linguistic root, to enable searches to match all forms of a term. This can be done using per-language rules or using generic cross-language rules. New stem rules should contain words that have been tokenized and transliterated:
Good Stemming Rule | Bad Stemming Rule | Reason |
---|---|---|
formulae → formula | formulæ → formula | Stemming occurs after transliteration |
Decomposition. This step breaks up single compounded words into multiple parts, to enable searches to match subparts of the word. Decomposition rules are particularly useful in languages such as German and Hungarian. New decomposition rules should contain compound words that are tokenized, transliterated and stemmed:
Good Decomposition Rule | Bad Decomposition Rule | Reason |
---|---|---|
mousetrap → mouse trap | mousetrap → mouse trap + mousetraps → mouse traps | Decomposition occurs after stemming. |
When handling queries, IDOL Server performs the same sequence of operations as during indexing described above, with a few small modifications:
Query Manipulation (QMS) including the QMS synonym functionality. QMS manipulates the original unprocessed (and therefore non-stemmed) query text. You should take this process into account when configuring QMS synonym rules.
Synonym Content Engines. Unlike QMS synonym searches, these are applied after all the language processing. Therefore, the synonym documents do not need to include non-stemmed text forms.
Decomposition. Decomposition occurs only at indexing time, not for queries.
From the previous example, searching for mousetrap returns only results about mousetraps. Searching for mouse returns results about mice as well as mousetraps.
|