The Index Flush Process

Flushing is the final step in the indexing process that makes your document data searchable. This stage updates the on-disk inverted index with the information that is held in the index cache, which has been built from parsing the input documents.

The flushing operation is predominantly bound by disk I/O. During the flush, the IDOL Content component completely rereads and rewrites the term dictionary file, to update the entries and insert new ones. The I/O is therefore proportional to the number of unique terms in your index. Content also completely rereads and rewrites the current term posting file.

Both of these operations require several linear reads of the original files, and a linear write to the output files.

During the flush, queries continue to run against the original dictionary and posting files. When the flush is complete, the queries swap to using the updated versions of these files, and IDOL deletes the original files.

Flush Locking

When multiple Content components use the same backend storage device (for example, sharing a physical disk or SAN device), simultaneous flushing can lead to I/O contention, which decreases the overall I/O throughput. When this happens, it can affect the indexing speed, because it slows down the flush for each Content component, and query speed, because I/O saturation affects the ability of the query threads to read the index files.

To mitigate I/O contention, Micro Focus recommends that you use either a flush lock file, if the file system supports file locking, or a flush lock server. These mechanisms allow you to limit how many components can flush simultaneously, while the others wait in turn.

You configure flush lock files by setting the FlushLockFile parameter in your configuration file [Server] section. You configure a flush lock server (Redis server) by configuring the [FlushLock] section in your configuration file. For more details, refer to the IDOL Server Reference.

Flush Frequency

The flush process is more efficient if the server can flush large amounts of data in one go, rather than lots of smaller jobs. You can manage this process by setting the IndexCacheMaxSize and MaxSyncDelay parameters to large values, and setting DelayedSync to True.

If you need to flush more frequently to satisfy your time-to-search requirements, it might be beneficial to set up a system with a small Content server to index new data, which you flush frequently, and a larger Content server to index the same data in bulk and flush less frequently.

An example of this set up might be:

A DIH mirrors data to both ContentA and ContentB.

Each night, a cron job or schedule sends an explicit DRESYNC index action to ContentB to force the nightly flush, and also sends a DREINITIAL index action to ContentA to reset it to an empty server for the next day.

In this case, you must use the DAH to query both Content components. You can deduplicate the query results by using the Combine query options.