Troubleshoot Categorization

Training

Ensure that you have good quality training documents. IDOL Server uses the content of the index fields of training documents as training, so pay particular attention to the content of those fields. For example, if the index fields contain a lot of useless metadata from Web pages, you are unlikely to get meaningful results. Each document should contain at least 50 words of text, and the more words the better. Checking the quality of training data should be the first priority if categorization results are disappointing.

Training documents should not be too large. If training data includes only a few very large documents, the results might be wasteful or misleading. IDOL Server assigns a greater importance to occurrences of words in different documents than to multiple occurrences in the same document.

Training data must be typical of the type of documents that you want to categorize. For example, if you want to categorize Web pages, then your training should be similar Web pages.

The best categories are clearly defined, and different to other categories. You might want to consider merging categories that have only a small number of training documents (fewer than 20), because these might be too fine-grained.

Ensure that your training data has been correctly categorized. Consider the entire content of the document, not just the title or the first few lines.

Generally, the more categories you have, the more training data you need to distinguish between them. The categories that are the least distinct from others require more training data than categories with no overlap with others.

Avoid using a list of words as training. IDOL Server is able to calculate the most appropriate concepts from training documents, and does not rely on a few human-chosen terms.

Categorization depends on the contents of the whole index, and not just the training documents. If you have two IDOL Server indexes that contain different documents, they will categorize differently, even if you use the same training documents.

The weights that training gives to terms depends on how often those terms occur in the index, not just on the training documents. For categorization, you obtain the best results if the index contains the complete set of training documents and nothing else.

Avoid using URLs to train your categories, because the linked file is likely to contain extra content, such as adverts, which are not useful. It is best to use the documents in the data index, which generally have been scrubbed of extraneous content and are in a useful format.

Check that IDOL Server can access your training documents; some might require security details.

Use the CategorySetAllowedTerms and CategorySetDisallowedTerms actions. These actions allow you to define what terms can and cannot be used to derive the TNWs (terms and weights) for a category from its training. With CategorySetAllowedTerms, only the terms in the allowed list can be used in TNWs. With CategorySetDisallowedTerms, no term in the disallowed list can be used in the TNWs. With these actions, you can allow or disallow terms that are specific to your industry. You should only use these actions to alter terms as a last resort; because IDOL Server is examining a large quantity of document data, certain terms might be more relevant than is apparent.

Building Categories

The available build settings for categories can have a large effect on the final list of TNWs produced for a category. You can set the following options using the CategorySetDetails action, with the Fields and Values parameters.

Option	Description
`WeightingAlgorithm`	The algorithm to use for weighting. This setting affects how IDOL Server derives the list of terms, according to the properties of the terms. For example: For option `3`, the proper names algorithm, IDOL discards any terms that are not proper names. For option `1`, language categorization, IDOL also normalizes the list of weights. In most instances, the default algorithm is best, and performs well for the common case of data sets in which there are both large and small training categories.
`MinTermLength`	The minimum length that a stemmed term must be for it to be included in the TNW list. The default value is 3, which is sufficient in most cases. However, if your dataset includes many two letter acronyms or words, you might want to reduce this value.
`MaxTerms`	The maximum number of terms that a category can have in the final TNW list. If more terms are generated, only the highest weighted terms are used. For most data sets (with more than 100 categories, and more than 20 documents per category), a value of 150-200 is best. For small data sets (a low number of well-trained categories), a value of 50-150 might be best, but increasing the value to 200 is generally fine.
`MinTermWeight`	The minimum weight that a term must have for IDOL Server to include it in the category’s final TNW list. In general, you should use the `MaxTerms` option to determine the point at which terms are discarded, so in most cases you must leave this value as zero. However, if a few categories have an unusually large number of important terms, this value might help to improve your results. For this situation, typical values are between 400 and 1,000.

Option

Description

WeightingAlgorithm

The algorithm to use for weighting. This setting affects how IDOL Server derives the list of terms, according to the properties of the terms.

For example:

For option 3, the proper names algorithm, IDOL discards any terms that are not proper names.

For option 1, language categorization, IDOL also normalizes the list of weights.

In most instances, the default algorithm is best, and performs well for the common case of data sets in which there are both large and small training categories.

MinTermLength

The minimum length that a stemmed term must be for it to be included in the TNW list. The default value is 3, which is sufficient in most cases. However, if your dataset includes many two letter acronyms or words, you might want to reduce this value.

MaxTerms

The maximum number of terms that a category can have in the final TNW list. If more terms are generated, only the highest weighted terms are used.

For most data sets (with more than 100 categories, and more than 20 documents per category), a value of 150-200 is best.
For small data sets (a low number of well-trained categories), a value of 50-150 might be best, but increasing the value to 200 is generally fine.

MinTermWeight

The minimum weight that a term must have for IDOL Server to include it in the category’s final TNW list. In general, you should use the MaxTerms option to determine the point at which terms are discarded, so in most cases you must leave this value as zero. However, if a few categories have an unusually large number of important terms, this value might help to improve your results. For this situation, typical values are between 400 and 1,000.

Send documentation feedback to Micro Focus

_FT_HTML5_bannerTitle.htm