IDOL Internal Storage and Indexes

Although IDOL Server does not have a database schema, it is possible to optimize the storage of certain types of field. Depending on the types of fields that you configure, you will observe different files and folders in your IDOL directory.

The IDOL Server index divides into several subindexes, which store content from different types of field. The following sections describe the different subindexes, and the memory requirements and impact of the different indexes.

IndexCache and IndexTemp

The IndexCache (and IndexTemp directory) stores an intermediate form of information used to build up the IDOL dynterm index. The IndexCache structure is non-persistent (but the data is eventually converted into its persistent form when IDOL Server flushes the IndexCache to generate the dynterm structure). The IndexTemp directory only contains data for the duration of an index flush.

The IndexCache memory usage is configured by IndexCacheMaxSize configuration parameter. Increasing the size of the index cache means that IDOL Server can process more documents before it must flush to disk.

The index cache stores the unique terms that IDOL has processed since the last flush, plus the new document IDs, positions, and other occurrence information for those terms.

Dynterm and Unstemmed

The dynterm is the persistent version of the IndexCache information, storing all the indexed terms and their document occurrence information. It consists primarily of a dictionary (the unique terms and metadata information about each term) and postings (the document IDs that the terms occur in, plus information on the occurrences of those terms).

Each unique term uses a fixed amount of disk space in the dictionary file (the record_size in bytes as reported by the GetStatus action, and controlled by the TermSize configuration parameter). The disk space usage for the postings information is (depending on configuration) between 4-16 bytes per occurrence of a term.

The dynterm structure itself does not consume any significant amount of memory at rest.

There are a number of ways to configure and control the terms that get indexed. Minimizing the amount of noise and unwanted terms in the index improves the performance. You can perform some of these improvements in components such as CFS, which process the data earlier in the indexing pipeline. In the Content component, the StopList, IndexNumbers (and related settings) and ProperNames configuration parameters are the more common options available to adjust the terms that you index. You can use the TermGetAll action to help analyze what terms are in the IDOL index. As an alternative to using the action, you can also view information about terms in the indexed data on the Terms tab on the Performance page in the Monitor section of IDOL Admin.

The dynterm directory also contains the unstemmed structure, which stores the full unstemmed version of terms in a memory mapped structure. Its memory and disk usage is therefore of the same order as that of the unique unstemmed terms, though the structure optimizes prefix matching, and so it stores common prefixes more efficiently. IDOL Server uses the unstemmed structure for wildcard searches, spelling correction, and fuzzy expansion.

Like the dynterm, minimizing the amount of data in the unstemmed index can improve performance. The UnstemmedMinDocOccs parameter allows you to filter out very rare terms from the unstemmed index (although for some applications, such as legal search this might not be appropriate). UnstemmedIndexNumbers and related settings can also prevent the unstemmed index from filling up with purely numeric or alphanumeric terms.

Index fields generate data for the dynterm and unstemmed structures. You should configure fields as Index fields if they contain data that requires conceptual searching, such as the body of an e-mail. It is typically better to store a highly structured field (like a document date) as one of the optimized types if it is required for search, and to make use of a FieldText or metadata search parameter. For example:

In the case of a document date, you can use DateType or NumericDateType field properties.

For more information about the dynterm storage mode, see Repository Storage Mode.

TermCache

TermCache memory is transient, representing the memory in use by the server threads as they load the dynterm information during query processing. This transient size depends on the terms that you query for.

Nodetable

The nodetable structure and directory stores both metadata information about each document section, and the physical representation of each document. IDOL Server stores the metadata information (for example, date, database, or fieldcheck) in a memory mapped structure, with each section consuming 64 bytes. The physical representation is approximately equal in size to the original IDX or XML format.

The nodetable metadata is primarily used for fast filtering checks, such as DatabaseMatch, MinDate, MaxDate, and FieldCheck.

IDOL Server primarily uses the physical representation to print results. It can also be used to perform unoptimized FieldText matching.

IDOL Server uses the nodetable data to perform a MATCH FieldText search on a field that is not specified as any other type.

Micro Focus generally recommends that you do not use this process, and it usually means that IDOL must perform a large number of loads from disk to find the documents that you want. Equally, do not make all fields that contain a number NumericType if you only use them when printing the document content.

The NodeTableStoreContent parameter and the StoredType property allow you to choose which fields to store in the index. You must store the content for some functionality, such as AQG. You can still use highlighting and summarization functionality even if the content is not stored locally in the index, by sending the data to highlight or summarize back to the server. You can use the Regenerate settings only for fields that are StoredType.

You can use the NodeTableCompression configuration parameter to compress the documents in the nodetable on disk. In this case, IDOL Server compresses data in the nodetable directory before storing it, reducing the IDOL Server disk footprint.

Numeric and NumericDate

The numeric structure gives a fast lookup of numeric value to document ID. Each numeric value stored uses approximately 16 bytes. A numeric field is normally wholly memory mapped, but you can limit the memory by using the NumericNormalMaxMem property for a field that is also NumericType. Making a field NumericType greatly speeds up the FieldText operators EQUAL, GREATER, LESS, NRANGE, NOTEQUAL, and BIAS, and the geospatial FieldText operators DISTCARTESIAN, DISTSPHERICAL, BIASDISTCARTESIAN, and BIASDISTSPHERICAL. Sorting on a numeric field is also optimized.

NumericDateType fields build an optimized index, which converts a date into an internal numeric autndate format and stores this in the numeric structure. Its memory usage is identical to a numeric field. This optimizes the GTNOW, LTNOW, and RANGE FieldText specifiers, and sorting on that field.

Match

The Match structure is a wrapper to the numeric structure, used for MatchType fields. Each unique value is mapped to an integer and then indexed into a numeric structure.

The memory requirement for Match is equal to Numeric, with the addition of a value mapping. The value mapping size is proportional to the size of all unique match values that have been indexed. You can limit the memory usage for a MatchType field by also using the NumericNormalMaxMem property.

Parametric

The parametric structure is used to store an optimized lookup of document ID to values in that document. It includes a value mapping and an explicit document file.

The value mapping contains details of the values that the parametric fields in your index contain.The value mapping is wholly memory mapped and its size is proportional to the size of all unique parametric values that have been indexed into the server.

The explicit document file contains information about the parametric fields and values that occur in each document. You can use the ParametricMemoryMaxSize parameter to memory limit the explicit document file. Otherwise, it is mapped into memory for performance. Each parametric value in a document uses up to 8 bytes of storage in the explicit document file.

Sort

The sort structure stores an index for each configured SortType field to optimize sorting on those values. A SortType field uses slightly more than SortFieldStorageLength bytes per document, irrespective of whether the document has a value or if the value is smaller than the configured storage length. Micro Focus recommends that you use SortType fields if:

Most of your documents have a value in the SortType field.
Most of the values have a common prefix that you can remove by using SortFieldPrefixCSVs.
The values (after removing any common prefix) all fit in the configured SortFieldStorageLength.

For other cases it is typically better to use a NumericType, NumericDateType, or MatchType value if you commonly use the field for sorting.

Send documentation feedback to Micro Focus

_FT_HTML5_bannerTitle.htm