Deduplicate Indexed Content

The DREDUPLICATE index action performs duplicate detection on a set of documents in the index. While the KillDuplicates parameter in the DREADD index action detects duplicates during the index process, DREDUPLICATE acts on sets of documents after they have been indexed. In addition, while the KillDuplicates parameter primarily prevents duplicate documents from being indexed, DREDUPLICATE can delete duplicate documents, tag them with a prescribed value, or move them to a different database.

All DREDUPLICATE index actions must include a ReferenceField parameter. The process detects duplicates when two documents have the same value in this field. You can also add secondary filters to compare the values of an additional field.

You might check whether the content of a document has changed when it is seen for the second time.
In this case the ReferenceField might be the document URL, and the secondary filter might be the document content field.

IDOL Server processes the documents in increasing docid order. It keeps the latest version of a set of duplicates, and it moves, deletes, or tags the matching documents with a lower docid . Therefore, for any set of duplicates, only the version of the document that was indexed last does not get deleted, moved, or tagged.

TIP:

If you are using the wizard on the Index tab on the Console page in the Control section of IDOL Admin to submit data for IDOL Server to index, you can specify how IDOL should treat duplicate documents in the Kill Duplicates page of the wizard..

Example Usage

The following example moves all duplicate documents in the News and Blog databases to the Duplicates database:

DREDUPLICATE?ReferenceField=DREREFERENCE&DuplicateAction=database&Database=Duplicates&DatabaseMatch=News+Blog

The following example tags all duplicate documents in the News database by setting a field DuplicateFound with the value 1:

DREDUPLICATE?ReferenceField=DREREFERENCE&DuplicateAction=tag&TagField=DuplicateFound&TagValue=1&DatabaseMatch=News
 

 


_FT_HTML5_bannerTitle.htm