Troubleshoot Categorization

In most cases, the default settings for categorization give very good results. However, for some data sets you might be able to improve your results by considering the following factors.

Training

Building Categories

The available build settings for categories can have a large effect on the final list of TNWs produced for a category. You can set the following options using the CategorySetDetails action, with the Fields and Values parameters.

Option Description
WeightingAlgorithm

The algorithm to use for weighting. This setting affects how IDOL Server derives the list of terms, according to the properties of the terms.

For example:

For option 3, the proper names algorithm, IDOL discards any terms that are not proper names.

For option 1, language categorization, IDOL also normalizes the list of weights.

In most instances, the default algorithm is best, and performs well for the common case of data sets in which there are both large and small training categories.

MinTermLength The minimum length that a stemmed term must be for it to be included in the TNW list. The default value is 3, which is sufficient in most cases. However, if your dataset includes many two letter acronyms or words, you might want to reduce this value.
MaxTerms

The maximum number of terms that a category can have in the final TNW list. If more terms are generated, only the highest weighted terms are used.

  • For most data sets (with more than 100 categories, and more than 20 documents per category), a value of 150-200 is best.

  • For small data sets (a low number of well-trained categories), a value of 50-150 might be best, but increasing the value to 200 is generally fine.

MinTermWeight The minimum weight that a term must have for IDOL Server to include it in the category’s final TNW list. In general, you should use the MaxTerms option to determine the point at which terms are discarded, so in most cases you must leave this value as zero. However, if a few categories have an unusually large number of important terms, this value might help to improve your results. For this situation, typical values are between 400 and 1,000.

_FT_HTML5_bannerTitle.htm