Order of Language Processes

At Index Time

During the index process, IDOL performs its language-dependent processing in the following order:

Automatic Language Detection (ALD) and content enrichment such as sentiment analysis. These methods process the original text.

Sentence breaking and NGram tokenization. These libraries are used primarily for Asian languages, where gaps between words are not defined by spaces and punctuation marks.

NOTE:

Sentence breaking can, depending on the library and configuration, result in the text being tokenized, normalized and even stemmed, which can affect the subsequent processing below.

Tokenization. This step converts the original text into sequences of tokens used for matching queries against. Tokenization matches characters as text characters, separator characters or non-separator characters depending on configured per-language rules. The rules are applied in the following order of precedence, meaning that characters will match rules at the top of the table in preference to those at the bottom:

Rule	Example Value	Effect
`HyphenChars`	.	joe.smith tokenized as joe, smith and joesmith
`NumberPunctuation`	.	3.14 tokenized as 3.14, not 3 and 14
`TangibleCharacters`	+	c++ tokenized as c++, not c
`SoftSeparators`	1234567890	123 tokenized as 1, 2, 3, not 123
`AugmentSeparators`	.	joe.smith tokenized as joe, smith

Normalization and transliteration. This process maps characters into a canonical form so that a search for different forms matches a term containing any of these forms. This happens using built-in single language or cross-linguistic rules.

Stop word detection. This process identifies words that are deemed to play no useful part in the retrieval process and can be ignored at both index and query time. These are configured per-language or cross-linguistically in sets of words called stoplists. Stop lists should contain words that have been tokenized and transliterated but not stemmed:

Good Stop Word	Bad Stop Word	Reason
dont	don't	Stop word detection occurs after tokenization.
cafe	café	Stop word detection occurs after transliteration.
site + sites	site	Stop word detection occurs before stemming.

Stemming. This step reduces words to their linguistic root, to enable searches to match all forms of a term. This can be done using per-language rules or using generic cross-language rules. New stem rules should contain words that have been tokenized and transliterated:

Good Stemming Rule	Bad Stemming Rule	Reason
formulae → formula	formulæ → formula	Stemming occurs after transliteration

Decomposition. This step breaks up single compounded words into multiple parts, to enable searches to match subparts of the word. Decomposition rules are particularly useful in languages such as German and Hungarian. New decomposition rules should contain compound words that are tokenized, transliterated and stemmed:

Good Decomposition Rule	Bad Decomposition Rule	Reason
mousetrap → mouse trap	mousetrap → mouse trap + mousetraps → mouse traps	Decomposition occurs after stemming.

For Queries

When handling queries, IDOL Server performs the same sequence of operations as during indexing described above, with a few small modifications:

Query Manipulation (QMS) including the QMS synonym functionality. QMS manipulates the original unprocessed (and therefore non-stemmed) query text. You should take this process into account when configuring QMS synonym rules.

Synonym Content Engines. Unlike QMS synonym searches, these are applied after all the language processing. Therefore, the synonym documents do not need to include non-stemmed text forms.

Decomposition. Decomposition occurs only at indexing time, not for queries.

From the previous example, searching for mousetrap returns only results about mousetraps. Searching for mouse returns results about mice as well as mousetraps.

Send documentation feedback to Micro Focus

_FT_HTML5_bannerTitle.htm