Stemming is the process of reducing a word to its linguistic root. The purpose of this reduction is to find a base term, so that a search can be expanded to include all forms of the term. For example, you generally want a search for the word elections to match a document that contains the word election. As long as the two terms stem to the same form, both return in the search.
Stemming is defined for individual languages because the stemming rules depend on how the language expands the root term. You can turn stemming on and off independently for different languages. For a list of the languages that have a default stemming algorithm, see Stemming Languages.
During indexing, IDOL Server stems each term, and stores the stem as well as the unstemmed term. During querying, IDOL Server stems the query term, and matches it against the stored stems in the index.
IDOL Server default stemming rules apply only to alphabetical terms.
Not all implementations require stemming. For example, Legal search often requires that the search returns only exact matches. In this case, you can turn stemming off when you set up IDOL Server.
Alternatively, if you enable Advanced Search, you can search for stemmed or unstemmed (exact) terms. In advanced search, you use quotation marks ("
) to search for an exact term. For example, with advanced search:
A search for election
matches both election
and elections
.
A search for "elections"
matches only the exact form (elections
).
In some cases, you might find that the stem of a word does not follow obviously from the word. For example, if you found that the word computer
stems to the nonexistent term comput
, you might think there is a problem with the stemming algorithm. No stemming algorithm is perfect. However, this type of stemming does not generally cause any serious problems.
Practically, there is no reason for the stem to bear any linguistic resemblance to the original term. The aim of the stemming algorithm is to ensure that all terms with the same linguistic root stem to the same value, and that all terms with different linguistic roots stem to different values.
For example, help, helping, helped, helps, helpful, helpfully should all stem to the same root. The stemming algorithms use the linguistically related root where possible. In this case, ideally, it should be help. However, if they all have the stem xyzzy then it has the same result, as long as no other term uses this stem.
Occasionally you might find words that do not stem in the way you want. For example, you might want the word mice
to stem to mouse
. In this case, you can easily create a list of user-defined stemming rules, by using a stemming file. You can also use this method to create stemming rules for terms that are not normally stemmed, including alphanumeric terms.
Alternatively, if you have an existing stemming algorithm that you need to use, then you can use a sentence breaking library.
If you change your stemming settings, you must reindex all your data for the changes to take effect.
Stemming considerations are important when dealing with cross-language search.
|