Deduplication Constraints
There are some constraints on deduplication when using other IDOL parameters.
Use the Combine Operation
The IDOL Content component cannot use the same ReferenceType field for deduplication as it uses for the Combine action parameter. The Combine operation occurs at query time and clashes with deduplication. If you intend to deduplicate when indexing and use the Combine action parameter, you must set up separate ReferenceType fields for these processes.
Use Deduplication with DIH Reference-Based Indexing
You can enable the DIH for reference-based indexing. Refer to the DIH Administration Guide.
If you index documents into IDOL with the DIH enabled for reference-based indexing, it might prevent deduplication of documents with different references. In this case, use only one of the following deduplication options:
-
KillDuplicates=REFERENCE
-
KillDuplicates=NONE
Use Deduplication with DIH Field-Based Indexing
You can use field-based indexing in the DIH to ensure correct deduplication in a distributed system. For more information on configuring the DIH for field-based indexing, refer to the DIH Administration Guide.
If you set KeepExisting to False
, or use KillDuplicatesDB options, it might prevent correct deduplication. To deduplicate correctly, you can distribute data by the DeDupeHash
field (MD5 hash) of the documents. In this way, DIH sends all duplicates to the same child server. Setting KillDuplicates to DeDupeHash
during the indexing action then ensures accurate deduplication.
To use a field for deduplication, you must configure it as a ReferenceType field. You do not need to configure it as ReferenceType in the DIH configuration file.
Deduplication of content occurs for all reference fields specified in a single PropertyFieldCSVs list in the IDOL Content component configuration file. To use only the DeDupeHash
field to deduplicate, and not also the DREREFERENCE
, you must set these reference fields in separate field processing sections in the IDOL Content component configuration file.