Cross-Lingual Search

By default, a search returns only documents in the same language as the search text. If you want a search to return documents in multiple languages, you must ensure that IDOL Server indexes text in a consistent way for all languages.

Before you index documents, you must configure IDOL Server to tokenize and process all languages in the same way.

The following table describes the concepts that you must make consistent for all languages, and describes the method that you can use to achieve this consistency.

Concept Method
Character definitions

Define the same separator and non-separator characters for all languages.

See Also: AugmentSeparators, DiminishSeparators, HyphenChars.

TIP:

You might want to add the apostrophe (') in either AugmentSeparators or DiminishSeparators, to ensure that IDOL Server treats it the same way for all languages. For example, by default the term l'accord is combined as laccord in English, while in French it is separated as the terms l and accord.

Stemming

Use the generic stemming algorithm, which incorporates stemming rules from all configured languages.

See Also: GenericStemming, Stemming.

Stop lists

Use the international stop list. This stop list includes only words that are stop words in all languages where that word exists. It does not include words that are a stop word in one language but that are a useful word in another.

NOTE:

You can still modify the international stop list for your requirements.

Transliteration

Use generic transliteration so that characters are modified consistently for all languages.

See Also: GenericTransliteration.

Sentence breaking and N-Gram Tokenization

Use NGram tokenization to allow consistent tokenization for all Asian languages. The IDOL Server configuration also allows you to use NGram tokenization for only the Asian languages that otherwise require sentence breaking (Chinese, Japanese, Korean, and Thai).

See Also: NGram, NGramMultibyteOnly, NGramOrientalOnly.

NOTE:

Even if your data contains documents in multiple languages, it is not always desirable to use these cross-lingual settings. For example, it is impossible to obtain the same precision of stemming with the GenericStemming algorithm as with the standard, language-specific algorithms. It might even be preferable to index all documents in the single most common language, to ensure that documents in that language are handled optimally.

AnyLanguage Queries

All textual queries must have the language and encoding defined as part of the query. IDOL Server uses the encoding to interpret the text, and the language to apply configured language settings such as stemming, character handling and transliteration. You specify the two values in the Language Type for the query.

By default queries only return documents in the same language as the query. You can override this behavior by setting the AnyLanguage action parameter. However, because the query has already processed the text using a particular language there is no guarantee that this method will match documents in other languages containing a particular query term.

In an English query for Reading, the term is stemmed to READ. The query matches all English documents that contain this term. The same query against non-English documents matches only those documents in which a term has also been stemmed to READ, which may not include those that originally had Reading in.

 


_FT_HTML5_bannerTitle.htm