You can create clusters immediately from the results of a Query or Suggest
. All documents in the results set are addressed by this method.
To cluster the results of a Query
or Suggest
action, you simply add the Cluster
action parameter. In addition to the ordinary query results, IDOL Server now returns clustering information in each result. For example:
<autn:hit> <autn:reference>http://en.wikipedia.org/wiki/Fauna of Australia</autn:reference> <autn:id>24152</autn:id> <autn:section>0</autn:section> <autn:weight>90.48</autn:weight> <autn:cluster>4</autn:cluster> <autn:clustertitle>supercontinent Gondwana, Cretaceous, fauna, MYA</autn:clustertitle> <autn:links>FAUNA,FLORA</autn:links> <autn:database>Default</autn:database> <autn:title>Fauna of Australia</autn:title> </autn:hit>
Every result is tagged with the ID and title of the cluster it is assigned to. There is no restriction on the number of clusters that can be created, or on how many documents make up a cluster. For example, you might have a cluster with only one document.
You can use two configuration parameters to control clustering behavior:
ClusterThreshold
. The minimum percentage relevance two documents must have to be in the same cluster. The default value is 50
.
ClusterTitleLength
. The maximum number of terms and phrases that can return as cluster titles. The default value is 4
.
In query result clustering, IDOL Server selects the most relevant document in the query result set as the basis for the first cluster. It then compares the remaining results to this document, and adds them to the cluster if the relevance to the first document exceeds the configured ClusterThreshold
. IDOL Server then applies this process to the remaining unclustered documents, and continues until all results are assigned to a cluster.
A query returns 10 results, numbered 0 to 9. Document 0 is the basis of the first cluster. When compared to this basis document, documents 1 and 2 have a relevance score higher than 50, so they are added to the cluster. So the first cluster contains documents 0, 1, and 2.
This process continues for the remaining documents. Document 3 is the basis for the second cluster, and documents 4, 5, 6, and 7 have high enough relevance scores and are added to this cluster. Document 8 is the basis of the third cluster, and document 9 is similar to it. No results remain, so the process ends.
IDOL Server creates a title for each cluster according to the best terms and phrases contained in the cluster documents.
Consider the following points when you are deciding whether to use clustering from query results:
Results return immediately on completion. Some other approaches require a separate action to retrieve the clustering results.
Cluster results reflect the state of the index at the time the query is run, and the results reflect any new content. Other approaches might produce results that are already irrelevant when new content is indexed, because they run the clustering operation on the state of the index at a fixed point in the past. This approach is well-suited to indexes that are constantly being updated.
You can easily adjust the results, using two simple configuration parameters.
The process slows down markedly as the result set size increases. You might need to experiment to find the best compromise between the number of results needed and the response speed.
The process uses an action thread, which can reduce server performance for other operations.
The results of query result clustering operations are not saved to disk. You must generate the results again to reduce them, which is not time effective, especially for large results sets.
This method might produce many small clusters, which do not provide much information about the index. This is especially true for large result sets.
Only result data is returned. If you want to produce a visualization, you must create it yourself.
The only indication of the significance is the number of documents it contains. If you require further indications, consider a different clustering approach.
The clusters returned are indicative only of the result set, not the whole index. To find index-wide trends, use a different approach.
This process clusters documents only in a single IDOL Server Content component. If you want to cluster documents in a distributed environment, consider using a different approach.
|