Flushing is the final step in the indexing process that makes your document data searchable. This stage updates the on-disk inverted index with the information that is held in the index cache, which has been built from parsing the input documents.
The flushing operation is predominantly bound by disk I/O. During the flush, the IDOL Content component completely rereads and rewrites the term dictionary file, to update the entries and insert new ones. The I/O is therefore proportional to the number of unique terms in your index. Content also completely rereads and rewrites the current term posting file.
Both of these operations require several linear reads of the original files, and a linear write to the output files.
During the flush, queries continue to run against the original dictionary and posting files. When the flush is complete, the queries swap to using the updated versions of these files, and IDOL deletes the original files.
When multiple Content components use the same backend storage device (for example, sharing a physical disk or SAN device), simultaneous flushing can lead to I/O contention, which decreases the overall I/O throughput. When this happens, it can affect the indexing speed, because it slows down the flush for each Content component, and query speed, because I/O saturation affects the ability of the query threads to read the index files.
To mitigate I/O contention, Micro Focus recommends that you use either a flush lock file, if the file system supports file locking, or a flush lock server. These mechanisms allow you to limit how many components can flush simultaneously, while the others wait in turn.
You configure flush lock files by setting the FlushLockFile
parameter in your configuration file [Server]
section. You configure a flush lock server (Redis server) by configuring the [FlushLock]
section in your configuration file. For more details, refer to the IDOL Server Reference.
The flush process is more efficient if the server can flush large amounts of data in one go, rather than lots of smaller jobs. You can manage this process by setting the IndexCacheMaxSize
and MaxSyncDelay
parameters to large values, and setting DelayedSync
to True.
If you need to flush more frequently to satisfy your time-to-search requirements, it might be beneficial to set up a system with a small Content server to index new data, which you flush frequently, and a larger Content server to index the same data in bulk and flush less frequently.
An example of this set up might be:
MaxSyncDelay
set to 60 (every minute).MaxSyncDelay
set to 86400 (every 24 hours)A DIH mirrors data to both ContentA and ContentB.
Each night, a cron job or schedule sends an explicit DRESYNC
index action to ContentB to force the nightly flush, and also sends a DREINITIAL
index action to ContentA to reset it to an empty server for the next day.
In this case, you must use the DAH to query both Content components. You can deduplicate the query results by using the Combine
query options.
|