Document Tag Clustering

Workflow

To cluster documents as part of a post-index process, use the DRETAGDOCCLUSTERS index action. This index action allows you to cluster every document in your index, or a specified subset based on a range of document IDs that you specify. You also specify a field that IDOL must use to create cluster titles, which it uses to populate the tag field. When the clustering is complete, IDOL adds a field to the documents to indicate the cluster it belongs to.

Algorithmic Outline

For every document in the specified ID range, IDOL Server sends a Suggest action. The most relevant suggested item that has a TagSourceField is selected, and its TagSourceField is added to the first document. If there is no suggested item, the document is tagged with its own TagSourceField. If a document already has a TagField, the process moves on to the next document.

Things to Consider

Consider the following points when you are deciding whether to use document tag clustering:

This method is useful for duplicate content detection. Using a CheckSumField clusters documents with identical content together, and allows you to easily retrieve such documents. You can then perform further actions, such as moving or deleting the documents. No other approach to clustering makes this a simple process.

The results of this operation are saved permanently, and you can easily retrieve them. Other approaches require regeneration of the results if you need to use them again.

You can cluster documents flexibly, from the entire index, to some specified subset.

You can use the RelevanceField option to see how well each document fits into its cluster. Some approaches do not offer this transparency.

Some clusters might contain only one document; documents with no suggested documents are tagged with their own TagSourceField, which means that outlier documents might form very small clusters. There is no filtering mechanism to discard these clusters.

The title of each cluster is the value of the TagSourceField in the first document to be assigned to the cluster, so it might not be representative of the cluster as a whole.

The algorithm becomes slower as the document range is increased. You might need to experiment to find the best compromise between range size and speed for your end use.

This method uses an index thread, which can decrease server performance for other operations.

You can run this index action only once for a particular set of documents and TagField. After documents have been tagged, you cannot replace the value of the TagField with this process. If you want to perform the operation again, you must either use a different field, or remove the TagField from the documents, which might affect other processes that rely on it. Other clustering processes allow simple repetition of the operation, so they can dynamically reflect the index.

If results are not satisfactory, you cannot easily adjust them without removing the TagField and adjusting the index action parameters.

The only indication of the significance of a cluster is the number of documents that it contains. If you require further indications, consider using a different approach.

This process only clusters documents in a single IDOL Server Content component. In a distributed environment, the DRETAGDOCCLUSTERS creates clusters individually within each child server, and there is no straightforward method to use to compare or merge clusters from different child servers. If you need to cluster all the documents in a distributed environment, consider using a different clustering approach.

Send documentation feedback to HPE

_HP_HTML5_bannerTitle.htm