The DREDUPLICATE
index action performs duplicate detection on a set of documents in the index. While the KillDuplicates
parameter in the DREADD
index action detects duplicates during the index process, DREDUPLICATE
acts on sets of documents after they have been indexed. In addition, while the KillDuplicates
parameter primarily prevents duplicate documents from being indexed, DREDUPLICATE
can delete duplicate documents, tag them with a prescribed value, or move them to a different database.
All DREDUPLICATE
index actions must include a ReferenceField
parameter. The process detects duplicates when two documents have the same value in this field. You can also add secondary filters to compare the values of an additional field.
You might check whether the content of a document has changed when it is seen for the second time.
In this case the ReferenceField
might be the document URL, and the secondary filter might be the document content field.
IDOL Server processes the documents in increasing docid
order. It keeps the latest version of a set of duplicates, and it moves, deletes, or tags the matching documents with a lower docid
. Therefore, for any set of duplicates, only the version of the document that was indexed last does not get deleted, moved, or tagged.
If you are using the wizard on the Index tab on the Console page in the Control section of IDOL Admin to submit data for IDOL Server to index, you can specify how IDOL should treat duplicate documents in the Kill Duplicates page of the wizard..
The following example moves all duplicate documents in the News
and Blog
databases to the Duplicates
database:
DREDUPLICATE?ReferenceField=DREREFERENCE&DuplicateAction=database&Database=Duplicates&DatabaseMatch=News+Blog
The following example tags all duplicate documents in the News
database by setting a field DuplicateFound
with the value 1
:
DREDUPLICATE?ReferenceField=DREREFERENCE&DuplicateAction=tag&TagField=DuplicateFound&TagValue=1&DatabaseMatch=News
|