IDOL can identify clusters by using representations of what is present in the set of documents at a certain point in time. This representation is known as a snapshot. The aim of this type of clustering is to identify trends in the data, rather than to include every document from the total set.
Snapshots contain details of potential clusters, known as seeds. A seed is made up of the document, and the best scoring results from a Suggest
action. The number and quality of suggested documents must exceed specified levels.
With scheduling, you can take snapshots at regular intervals to keep track of the set of documents as it changes over time. You can also use snapshots to create visual representations of the state of your set at a particular time, or over a particular time frame.
HPE recommends that you perform clustering options with the Cluster actions in the Category component. To create a set of clusters, run the following actions:
ClusterSnapshot
. This action takes a snapshot of your data. You can also add a query, if you want to find clusters in a certain set of documents.
ClusterCluster
. This action identifies clusters in the specified snapshot and saves them to disk for later use.
ClusterResults
. This action returns the clusters as an XML response. It returns details for each clusters, including the title, document details, and an importance score (see Algorithmic Outline).
Consider the following points when you are deciding whether to use clustering from snapshots:
You can schedule the various Cluster functions to run repeatedly at particular times.
IDOL uses a background processing thread for most of the functions, so the action threads remain available. All other clustering approaches require the use of an action or index thread, potentially for a relatively long time.
IDOL automatically produces the visual representations of clusters. For other clustering approaches, you must use the data returned to create your own visualization.
IDOL saves the cluster results to disk so that you can reuse them for various purposes without regenerating the results. Other clustering approaches generate results on request and do not store them.
You can also configure IDOL to remove old cluster results after a specified expiration time, to ensure that old or irrelevant data is not present.
Cluster snapshots consider your whole index, or a user-specified subsection. This allows you to identify index-wide trends, which other approaches might miss. Other clustering approaches consider only a part of the index.
You can take a snapshot for any point in the past, allowing you to identify clusters from that time period, even if historical data has only recently been added to the index. Other approaches can consider only the current state of the index.
You can optionally create your own visual representations of cluster data; you can easily view and retrieve all the cluster data. You can also enhance the existing representations, for example by overlaying cluster data onto a spectrograph to produce an intuitive visualization of the index.
IDOL gives clusters a score according to their importance, so you can easily see the most valuable trends, and decide whether to consider clusters with lower scores. Other approaches do not produce scores, which might imply that each cluster is equally important.
You can automatically filter out weak clusters (clusters with very few documents, or low importance scores). With this option, you can remove irrelevant clusters from your results. Other approaches, such as query result clustering, might produce many clusters with only a few documents.
This approach is slow compared to other clustering approaches, which is why it runs in a background process. The cluster extraction process is generally quite quick, but snapshot creation can take several minutes for large indexes.
It is not always easy to diagnose the cause when clustering produces poor results. You might need to experiment to achieve the best results.
You must chain together several actions to produce clusters, or visual representations. This multi-stage process is not the most user-friendly method of clustering documents.
Additionally, you must examine the log files or use the ScheduleGetResult
action repeatedly to find out when each function has finished. Other approaches present cluster information immediately upon completion.
This method is highly configurable, which gives a large scope for confusion or poor results.
The default values for clustering parameters do not take your index into account, so it is easy to obtain strange or poor results if your index has unusual properties, for example if it is small. You might need to use experimentation to find the best configuration.
HPE does not recommend this process if you need to perform clustering very frequently, because it is slow. In this case, you might want to use a lighter weight approach, such as dynamic clustering.
|