In categorization, you assign documents to categories according to what the document and the category have in common.
A category defines a subject of interest. You can train categories with a set of documents, or by constructing Boolean expressions or text that describes the subject.
You can use categories to return documents that match the category, or to suggest the categories that a document or piece of text might match.
Categorization automatically identifies the ideas that documents contain, and classifies these documents according to their content. You can tag the documents according to the category, and you can use the tags as a trigger for a further workflow process, such as approval or examination.
Categories provide an intuitive way to navigate documents, and to filter out unimportant information from a large volume of information, and focus only on what matters.
There are three kinds of categorization: conceptual categorization, binary categorization, and simple categorization. The following section describes conceptual categorization.
Before you can use categories, you must train and build them.
You can either create categories manually, or automatically create categories from your documents.
Manual Categories. You can manually create a category, and add training to define the kinds of documents that you want to assign to the category. Creating categories manually is most useful if you want your categories to follow a specific structure.
Clusters. You can use clustering with snapshots and import the clusters to categories. Automatic categorization from clusters is useful if you want to be able to refine results by categories, but do not need a specific set of categories.
Automatic clustering also allows you to analyze your content, and generate a visualization that allows you to see the clusters of content.
Category training describes the kinds of documents that the category should find. Training can include:
examples of the documents that the category should find (Document Training).
Boolean and FieldText expressions (Boolean Training) that result documents must match (for example, a document must contain a certain value in a certain field). This type of training applies only to conceptual categorization.
a list of words and concepts (Plain Text Training) that result documents must match, at least in part. For example, Olympics sport.
Micro Focus recommends that you train categories with many example documents.
When IDOL Server builds a category, it analyzes the training and produces a list of the most important terms in the training, with a list of weights to indicate how important each term is. It gives terms a higher weight when they occur more often in the different training buffers (relative to how often it occurs in the content index).
You can impose restrictions on how often terms must appear in the training, or the terms that must be explicitly included or discarded, even allowing for stop lists. In addition, you can apply an attenuation factor, which determines how quickly a term decreases in importance if it is not relevant.
For details, see the settings given in the Troubleshoot Categorization section.
|