Choose a Distribution Mode

The distribution mode configuration options determine how DIH divides documents between child servers. The appropriate mode to use depends on a number of factors:

the architecture

the nature of the incoming data

requirements for indexing performance

other index considerations, in particular whether and how you need to remove duplicate documents from the index.

The following flowchart describes the decision process that you need to make when you want to choose a distribution mode. Click the options in the diagram to find more information about the configuration options. The following sections discuss this decision process in more detail.

Deduplication Considerations

In most configurations, you use Connectors to periodically look for modifications to documents in the source repository, and index the updates. In these cases, you use the KillDuplicates options to control how to treat the update documents. If you require that the IDOL index contains only the latest version of any document, you should identify a reference field whose value is common to all revisions of the same document, and unique to that document, and use this field for deduplication.

For deduplication to function correctly in a distributed index, the server that contains the older version of the document must receive updates when you index a new version. There are two ways to achieve this:

Route incoming data to child servers based on the identifying reference field, so that new documents always go to the server that contains any previous version. See Advanced Distribution Modes.

This option reduces the amount of data that gets sent over the network and parsed by the child servers. However, it makes adding new servers a labor-intensive process. It also requires the DIH to parse the data before sending it to the child servers, which can increase the latency in the system (that is, time taken for documents to be available for search).

Send all documents to all child servers, though each server indexes only a portion of the data. See Simple Distribution Mode. In this option, the child servers delete any existing documents that match a document in the incoming data, because that new version will be indexed somewhere in the distributed system.

This option sends more data to the child servers than the advanced distribution modes, but it does not have the restrictions about adding child servers. It also provides more flexibility about how you route the data to child servers. For example, you can assign weights for how much data to index into each child server, or specify that some servers only update old data, but do not index new data.

In other scenarios, it is possible that original documents get re-sent to IDOL, but you always want to keep the original version, or keep field values from the previous version. For example, users might have tagged or otherwise updated the document. In this case, you must use the KeepExisting or preserve field options, and documents with matching references must always be sent to the same child server, and you must use the advanced distribution mode.

Other Considerations

In the cases where you never index duplicate copies of documents, or where you need to be able to search all versions of a document, deduplication is never required at index time. In this case, you can use the following guidelines to choose a distribution mode.

If you expect to index a steady stream of documents in reasonably small, evenly-sized batches, Batch Mode might be a suitable choice.

This is the most lightweight mode. However, it might be unsuitable if you index documents infrequently, in particularly large batches, or in batches of widely differing sizes. In such cases, rotating index jobs between child servers might not result in an approximately even spread of data across the servers.

Advanced distribution modes give the minimum amount of data sent over the network, or parsed by child servers, and usually result in an approximately even spread of documents, even for a small number of large batches.

Because you do not need deduplication, you do not need to be sure that duplicate documents are always routed to a particular server, which means that you can add or remove child servers as required. The main additional overhead in this mode is parsing of data by the DIH.

NOTE:

In advanced distribution modes, you cannot dynamically add child servers.

Certain features are available only in simple distribution mode and (for version 10.1.0 of DIH and later) batch mode. These include redistribution of data between active servers when one is unavailable, indexing only into child servers that are not full, and using update-only servers.

Send documentation feedback to HPE

_HP_HTML5_bannerTitle.htm