Cross-Lingual Search

By default, a search returns only documents in the same language as the search text. If you want a search to return documents in multiple languages, you must ensure that IDOL Server indexes text in a consistent way for all languages.

Before you index documents, you must configure IDOL Server to tokenize and process all languages in the same way.

The following table describes the concepts that you must make consistent for all languages, and describes the method that you can use to achieve this consistency.

Concept	Method
Character definitions	Define the same separator and non-separator characters for all languages. See Also: `AugmentSeparators`, `DiminishSeparators`, `HyphenChars`. TIP: You might want to add the apostrophe (`'`) in either `AugmentSeparators` or `DiminishSeparators`, to ensure that IDOL Server treats it the same way for all languages. For example, by default the term l'accord is combined as laccord in English, while in French it is separated as the terms l and accord.
Stemming	Use the generic stemming algorithm, which incorporates stemming rules from all configured languages. See Also: `GenericStemming`, `Stemming`.
Stop lists	Use the international stop list. This stop list includes only words that are stop words in all languages where that word exists. It does not include words that are a stop word in one language but that are a useful word in another. NOTE: You can still modify the international stop list for your requirements.
Transliteration	Use generic transliteration so that characters are modified consistently for all languages. See Also: `GenericTransliteration`.
Sentence breaking and N-Gram Tokenization	Use NGram tokenization to allow consistent tokenization for all Asian languages. The IDOL Server configuration also allows you to use NGram tokenization for only the Asian languages that otherwise require sentence breaking (Chinese, Japanese, Korean, and Thai). See Also: `NGram`, `NGramMultibyteOnly`, `NGramOrientalOnly`.

Concept

Method

Character definitions

Define the same separator and non-separator characters for all languages.

See Also: AugmentSeparators, DiminishSeparators, HyphenChars.

TIP:

You might want to add the apostrophe (') in either AugmentSeparators or DiminishSeparators, to ensure that IDOL Server treats it the same way for all languages. For example, by default the term l'accord is combined as laccord in English, while in French it is separated as the terms l and accord.

Stemming

Use the generic stemming algorithm, which incorporates stemming rules from all configured languages.

See Also: GenericStemming, Stemming.

Stop lists

Use the international stop list. This stop list includes only words that are stop words in all languages where that word exists. It does not include words that are a stop word in one language but that are a useful word in another.

NOTE:

You can still modify the international stop list for your requirements.

Transliteration

Use generic transliteration so that characters are modified consistently for all languages.

Cross-Lingual Search

AnyLanguage Queries