Stop lists are sets of words (called stop words) that are ignored when processing both index fields at index time, and subsequently as part of queries. The primary reason to use stop lists is to reduce index size and make queries faster, by ignoring words that add little or no meaning to the retrieval process.
In most languages, the 100 most common terms (for example the, a, of, in, in English) make up around 50% of all text. In many cases, you can halve the storage space if you set these words as stop words.
Stop lists are set per-language, and standard stop lists are provided with IDOL for all major languages.
You can, and often should, edit stop lists to suit the situation. The primary considerations in doing so are:
Do any of the words in the stop list have relevance that would be useful in querying?
You might remove a word from the standard stop list if it is also an acronym used in your organization. For example:
If you need to search for IT (Information Technology), you must remove the word it from the stop list before indexing.
However, if you remove it from the stop list, a search for it might be comparatively slow and require a sizeable amount of system memory, as it is still an extremely common term.
Are there any additional words that occur commonly in your data that could also be added?
The following list shows some examples of terms that you might want to add as stop words, but which are not supplied in the default stop word lists:
Social Media. Acronyms, common abbreviations, and URL components might occur with higher frequency in social media than in other data sources. For example, you might want to add LOL, RT, http, www, and so on.
E-mails. The company name and e-mail address domain is very likely to occur in e-mail messages, and you might want to add these elements as stop words. If you want to be able to search for e-mail addresses, you can use Eduction to extract the e-mail addresses from documents and add them as document field, which you can then search for with a FieldText search.
Wikipedia. For a wikipedia data set, terms such as wikipedia, and stub are likely to occur at high frequency, and add little value to content searches.
Whenever you change the stop list, you must reindex the data for the change to take effect.
An exception to this is if you just want to ignore certain terms as part of queries. In this case you can simply change the stop list, although this does not have the benefit of reducing the index size.
In some instances it might be necessary to match stop words in certain queries but not all. The most common use case for this is to allow stop words to match as part of a phrase, but not otherwise. See StopWordIndex.
If you have already indexed some data then it is simple to determine the most common terms. Setting any of these as a stop word could reduce index size and help prevent slower queries.
The IDOL Admin interface provides a simple administrative user interface for exploring terms by the number of occurrences. In general, the terms with the highest number of occurrences are good candidates for stop words.
You can also send the following action to directly to IDOL Server to return the 100 most common terms, sorted by total occurrence count:
action=TermGetAll&MaxTerms=100&Type=TrueOccs
For cross-lingual search, you must use a single stop list that contains the most common terms for the relevant languages. See Cross-Lingual Search.
Related Topics
|