Stop Lists

Stop lists are sets of words (called stop words) that are ignored when processing both index fields at index time, and subsequently as part of queries. The primary reason to use stop lists is to reduce index size and make queries faster, by ignoring words that add little or no meaning to the retrieval process.

In most languages, the 100 most common terms (for example the, a, of, in, in English) make up around 50% of all text. In many cases, you can halve the storage space if you set these words as stop words.

Stop lists are set per-language, and standard stop lists are provided with IDOL for all major languages.

Modify Stop Lists

You can, and often should, edit stop lists to suit the situation. The primary considerations in doing so are:

NOTE:

Whenever you change the stop list, you must reindex the data for the change to take effect.

An exception to this is if you just want to ignore certain terms as part of queries. In this case you can simply change the stop list, although this does not have the benefit of reducing the index size.

Allow Retrieval of Stop Words

In some instances it might be necessary to match stop words in certain queries but not all. The most common use case for this is to allow stop words to match as part of a phrase, but not otherwise. See StopWordIndex.

Find Candidates for Stop Words

If you have already indexed some data then it is simple to determine the most common terms. Setting any of these as a stop word could reduce index size and help prevent slower queries.

The IDOL Admin interface provides a simple administrative user interface for exploring terms by the number of occurrences. In general, the terms with the highest number of occurrences are good candidates for stop words.

You can also send the following action to directly to IDOL Server to return the 100 most common terms, sorted by total occurrence count:

action=TermGetAll&MaxTerms=100&Type=TrueOccs

Cross-Lingual Stop Lists

For cross-lingual search, you must use a single stop list that contains the most common terms for the relevant languages. See Cross-Lingual Search.

 

Related Topics


_FT_HTML5_bannerTitle.htm