In most cases, the default settings for categorization give very good results. However, for some data sets you might be able to improve your results by considering the following factors.
Ensure that you have good quality training documents. IDOL Server uses the content of the index fields of training documents as training, so pay particular attention to the content of those fields. For example, if the index fields contain a lot of useless metadata from Web pages, you are unlikely to get meaningful results. Each document should contain at least 50 words of text, and the more words the better. Checking the quality of training data should be the first priority if categorization results are disappointing.
Training documents should not be too large. If training data includes only a few very large documents, the results might be wasteful or misleading. IDOL Server assigns a greater importance to occurrences of words in different documents than to multiple occurrences in the same document.
Training data must be typical of the type of documents that you want to categorize. For example, if you want to categorize Web pages, then your training should be similar Web pages.
The best categories are clearly defined, and different to other categories. You might want to consider merging categories that have only a small number of training documents (fewer than 20), because these might be too fine-grained.
Ensure that your training data has been correctly categorized. Consider the entire content of the document, not just the title or the first few lines.
Generally, the more categories you have, the more training data you need to distinguish between them. The categories that are the least distinct from others require more training data than categories with no overlap with others.
Avoid using a list of words as training. IDOL Server is able to calculate the most appropriate concepts from training documents, and does not rely on a few human-chosen terms.
Categorization depends on the contents of the whole index, and not just the training documents. If you have two IDOL Server indexes that contain different documents, they will categorize differently, even if you use the same training documents.
The weights that training gives to terms depends on how often those terms occur in the index, not just on the training documents. For categorization, you obtain the best results if the index contains the complete set of training documents and nothing else.
Avoid using URLs to train your categories, because the linked file is likely to contain extra content, such as adverts, which are not useful. It is best to use the documents in the data index, which generally have been scrubbed of extraneous content and are in a useful format.
Check that IDOL Server can access your training documents; some might require security details.
Use the CategorySetAllowedTerms
and CategorySetDisallowedTerms
actions. These actions allow you to define what terms can and cannot be used to derive the TNWs (terms and weights) for a category from its training. With CategorySetAllowedTerms
, only the terms in the allowed list can be used in TNWs. With CategorySetDisallowedTerms
, no term in the disallowed list can be used in the TNWs. With these actions, you can allow or disallow terms that are specific to your industry. You should only use these actions to alter terms as a last resort; because IDOL Server is examining a large quantity of document data, certain terms might be more relevant than is apparent.
The available build settings for categories can have a large effect on the final list of TNWs produced for a category. You can set the following options using the CategorySetDetails
action, with the Fields
and Values
parameters.
Option | Description |
---|---|
WeightingAlgorithm
|
The algorithm to use for weighting. This setting affects how IDOL Server derives the list of terms, according to the properties of the terms. For example: For option For option In most instances, the default algorithm is best, and performs well for the common case of data sets in which there are both large and small training categories. |
MinTermLength
|
The minimum length that a stemmed term must be for it to be included in the TNW list. The default value is 3, which is sufficient in most cases. However, if your dataset includes many two letter acronyms or words, you might want to reduce this value. |
MaxTerms
|
The maximum number of terms that a category can have in the final TNW list. If more terms are generated, only the highest weighted terms are used.
|
MinTermWeight
|
The minimum weight that a term must have for IDOL Server to include it in the category’s final TNW list. In general, you should use the Max Terms option to determine the point at which terms are discarded, so in most cases you must leave this value as zero. However, if a few categories have an unusually large number of important terms, this value might help to improve your results. For this situation, typical values are between 400 and 1,000. |
|