Open topic with navigation
By default, a search returns only documents in the same language as the search text. If you want a search to return documents in multiple languages, you must ensure that IDOL Server indexes text in a consistent way for all languages.
Before you index documents, you must configure IDOL Server to tokenize and process all languages in the same way.
The following table describes the concepts that you must make consistent for all languages, and describes the method that you can use to achieve this consistency.
Define the same separator and non-separator characters for all languages.
You might want to add the apostrophe (
Use the generic stemming algorithm, which incorporates stemming rules from all configured languages.
Use the international stop list. This stop list includes only words that are stop words in all languages where that word exists. It does not include words that are a stop word in one language but that are a useful word in another.
You can still modify the international stop list for your requirements.
Use generic transliteration so that characters are modified consistently for all languages.
|Sentence breaking and N-Gram Tokenization
Use NGram tokenization to allow consistent tokenization for all Asian languages. The IDOL Server configuration also allows you to use NGram tokenization for only the Asian languages that otherwise require sentence breaking (Chinese, Japanese, Korean, and Thai).
Even if your data contains documents in multiple languages, it is not always desirable to use these cross-lingual settings. For example, it is impossible to obtain the same precision of stemming with the
GenericStemming algorithm as with the standard, language-specific algorithms. It might even be preferable to index all documents in the single most common language, to ensure that documents in that language are handled optimally.
All textual queries must have the language and encoding defined as part of the query. IDOL Server uses the encoding to interpret the text, and the language to apply configured language settings such as stemming, character handling and transliteration. You specify the two values in the Language Type for the query.
By default queries only return documents in the same language as the query. You can override this behavior by setting the
AnyLanguage action parameter. However, because the query has already processed the text using a particular language there is no guarantee that this method will match documents in other languages containing a particular query term.
In an English query for Reading, the term is stemmed to
READ. The query matches all English documents that contain this term. The same query against non-English documents matches only those documents in which a term has also been stemmed to
READ, which may not include those that originally had Reading in.