Algorithmic Outline

In clustering from snapshots, IDOL creates a snapshot of seeds, and then uses them to form the clusters.

Create Seeds

The basic unit of clustering is a seed, which clusters form around.

To form a seed, a query is run on your index, and documents are picked at random from the results. For each of these documents, a Suggest action is run. If enough suggested items return, with a high enough similarity, the document and its suggested items are used as a seed.

You can configure how many suggested documents are required, and the minimum level of similarity, with the SeedSize and SeedBindLevel configuration parameters. A high SeedSize means that a larger number of suggested items are required to form a seed. A high SeedBindLevel means that the documents in the seed must be very similar to each other.

IDOL continues forming seeds until it has created enough, or until it exceeds the maximum number of attempts. You can configure the number of clusters that you want, which determines the number of seeds IDOL forms. When it has created all the seeds, IDOL writes the seed details to a file, in a binary format. This file is known as a snapshot.

Use the Snapshots

You can use snapshots in a number of ways. The primary purpose of them is to identify clusters.

IDOL loads the seed details from the snapshot file, and then merges seeds with other seeds that are sufficiently similar. You can configure how similar the seeds must be, by using the BindLevel configuration parameter. A high value of the parameter means that the seeds must be very similar for them to be merged.

When IDOL has merged as many seeds as possible, the merged seeds form clusters. IDOL then removes any duplicate document references from the clusters, and returns the configured number of the best clusters.

Clusters are ranked by the What’s Hot score. This score measures the similarity of the documents in the cluster, and the narrowness of the concepts it represents. A high What’s Hot score means that the cluster is more significant.

IDOL generates a title for the cluster, which is usually derived from the titles of documents in the cluster. However, you can use other document fields to generate titles (for example, summary fields).


_FT_HTML5_bannerTitle.htm