Index Your Content > Memory Usage in IDOL Server > IDOL Internal Storage and Indexes > Repository Storage Mode

Repository Storage Mode

CAUTION:

RepositoryStorage is an advanced configuration option. Do not change this parameter value unless it has been recommended by HPE Big Data Platform Support.

The RepositoryStorage configuration parameter controls the way that the IDOL Content component stores and accesses the dynterm index, which contains indexed terms and their document occurrence information.

In IDOL 10.0 and later, repository storage mode is on by default (RepositoryStorage=True). This is the recommended setting. For most IDOL setups, you need to consider RepositoryStorage only when you upgrade from IDOL version 7.x, which used a different default index format.

In most cases, HPE recommends that you upgrade IDOL by indexing your content into a clean installation, to take advantage of the newer index format. However, if you want to upgrade your IDOL components to a recent version without reindexing, you must set RepositoryStorage to False in your configuration to remain consistent with the earlier index format. For more information, see Upgrade IDOL Server Components.

HPE strongly recommends that you use repository storage mode for your index. This storage mode offers:

reduced disk footprint.
improved data safety.
improved query performance, particularly during indexing.
comparable indexing performance for most use cases.
better indexing performance for very large numbers of terms and occurrences in your index.
sequential disk I/O during indexing. Sequential I/O is generally more efficient than random disk I/O.

The following sections provide more detail about the differences between the two formats.

Disk Footprint

Repository storage mode uses a more efficient storage method, which uses less disk space than an equivalent index in the old format.

The old index format stores term postings in blocks to allow easier updates. However, this method can result in a large amount of unused space in the index files.

Repository storage mode uses only the required amount of space for the content, with much less waste.

Term Storage

The old index format stores terms in the order that they are added. When you add a new term, Content either appends it to the term file, or fills an existing space (for example, because of deleted terms).

In repository storage mode, Content stores terms in a fixed order.

Data Safety

Repository storage mode is much more resilient to indexing interruption, such as power failures. During the index flush, Content writes the new term and postings files, and then switches to them. If the write process is interrupted, the index is not updated, but existing data is not affected.

Indexing Performance

The fixed term ordering means that Content must completely rewrite the term file every time you add content to the index. For a large index with many unique terms, this might reduce the indexing speed compared to the old index format, particularly if you index a small number of documents very frequently.

You can mitigate this performance impact by using the DelayedSync and IndexCacheMaxSize configuration parameters. Together, these options control how often Content rewrites the index term files.

NOTE:

When you need to index small document batches frequently, and to have those documents be available for query immediately, repository storage mode might be slower than the old index format.

In this scenario, you might want to use a small index in the older format to process new content quickly. You can then export the content into a main index in repository storage mode, in larger batches. Contact HPE Big Data Platform Technical Support for advice.

In repository storage mode, Content mitigates the effect of rewriting large term files by splitting the term postings information file when it gets too large. This means that the number of term occurrences does not have a significant negative impact on the indexing speed.

Query Performance

The fixed term ordering in repository storage mode generally improves lookup speed during the query process. although it might read more data from disk compared to the old index format.

In repository storage mode there is also less competition for resources between the indexing and query processes. Content does not switch the existing term files until the new one has been written, so the indexing and query processes use different term files.

In general, repository storage mode improves query performance, particularly during indexing. However, if you have so many term occurrences that Content splits the term posting file, the performance improvements are not as great.

Disk I/O

The impact on disk I/O is most significant at the end of the indexing process, when Content flushes the index cache to disk.

In repository storage mode, the amount of data that Content must write depends on the total number of terms and the number of postings in the last term postings file. These term files can get very large. However, the write uses sequential disk I/O, which is usually very efficient.

In the older index format, the amount of data to write is generally much smaller (it depends mainly on the number of new terms and postings). However, there is a lot more random I/O and seeking during the indexing process.

The speed of the flush depends on your system, but in most cases repository storage mode has comparable performance to the old format.

DiskHash Limitations

The old index format uses a disk hash, which determines the number of hash values that Content can store in the dynterm index. You can configure the maximum value by using the DiskHash configuration parameter. You cannot change the disk hash size after you index data.

When the number of terms in your index approaches and exceeds the disk hash size, Content must store multiple terms with the same hash. In this case, Content must use multiple disk reads to find terms in the index, which significantly decreases the indexing performance as the number of terms increases.

Repository storage mode does not have a disk hash limitation. The number of terms can increase dynamically with less impact on the indexing performance.

Send documentation feedback to HPE

_HP_HTML5_bannerTitle.htm