After IDOL Server has processed the tokens to remove any that you do not want to index, it indexes the remaining tokens. During this process, it modifies the tokens to make it easier to return the correct documents. This involves the following processes:
Main Topic: Stemming
In many languages, a related set of words have a common root. For example, help, helping, helpful and helped, all stem from the common root help. You can reduce words to their common root without losing meaning.
The IDOL stemming algorithm reduces all words to their stem, and indexes the stem. This process allows you to search for a word, and return documents with related concepts that do not specifically include that word.
In some languages, there is more than one way to represent a character. For example:
the Roman alphabet has uppercase and lowercase forms of all letters.
the Japanese katakana script can have full width or half width characters.
the Chinese language has two scripts, usually known as Chinese Traditional and Chinese Simplified.
IDOL Server uses canonicalization to ensure that it treats all character forms equally. It automatically converts to an internationally recognized canonical form. Retrieval then matches all versions of a character.
Character normalization is controlled by the same configuration settings as Transliteration.
Transliteration is like Character Normalization in that it aims to map sets of characters to a standard form so that a search for different forms match documents containing any of those forms. An important example is the removal of accents from letters so that a search for cafe matches documents containing café, and the reverse. Similarly, the German letter ß is transliterated to ss.
Transliteration schemes differ by language. For example in German, the letter ö transliterates to oe, whereas in Swedish it transliterates to o. Several languages that use non-Roman scripts can also be transliterated to Roman. For example in Russian, IDOL transliterates Владимир to Vladimir. For details of the transliteration schemes used, see Transliteration Tables.
When you have multiple languages in your server, Micro Focus recommends setting GenericTransliteration
to True
. See Cross-Lingual Search.
Transliteration affects only the characters used to represent a word. IDOL Server does not translate terms from one language to another.
Turn Transliteration Off
Several languages have unusual linguistic behavior, such that even if you set Transliteration
to False
for that language, some characters are still transliterated. To prevent all transliteration, set:
[LanguageTypes] GenericTransliteration=True
and then set Transliteration
to False
in the individual language sections.
|