Although IDOL Server does not have a database schema, it is possible to optimize the storage of certain types of field. Depending on the types of fields that you configure, you will observe different files and folders in your IDOL directory.
The IDOL Server index divides into several subindexes, which store content from different types of field. The following sections describe the different subindexes, and the memory requirements and impact of the different indexes.
The IndexCache
(and IndexTemp
directory) stores an intermediate form of information used to build up the IDOL dynterm
index. The IndexCache
structure is non-persistent (but the data is eventually converted into its persistent form when IDOL Server flushes the IndexCache
to generate the dynterm
structure). The IndexTemp
directory only contains data for the duration of an index flush.
The IndexCache
memory usage is configured by IndexCacheMaxSize
configuration parameter. Increasing the size of the index cache means that IDOL Server can process more documents before it must flush to disk.
The index cache stores the unique terms that IDOL has processed since the last flush, plus the new document IDs, positions, and other occurrence information for those terms.
The dynterm
is the persistent version of the IndexCache
information, storing all the indexed terms and their document occurrence information. It consists primarily of a dictionary (the unique terms and metadata information about each term) and postings (the document IDs that the terms occur in, plus information on the occurrences of those terms).
Each unique term uses a fixed amount of disk space in the dictionary file (the record_size
in bytes as reported by the GetStatus
action, and controlled by the TermSize
configuration parameter). The disk space usage for the postings information is (depending on configuration) between 4-16 bytes per occurrence of a term.
The dynterm
structure itself does not consume any significant amount of memory at rest.
There are a number of ways to configure and control the terms that get indexed. Minimizing the amount of noise and unwanted terms in the index improves the performance. You can perform some of these improvements in components such as CFS, which process the data earlier in the indexing pipeline. In the Content component, the StopList
, IndexNumbers
(and related settings) and ProperNames
configuration parameters are the more common options available to adjust the terms that you index. You can use the TermGetAll
action to help analyze what terms are in the IDOL index. As an alternative to using the action, you can also view information about terms in the indexed data on the Terms tab on the Performance page in the Monitor section of IDOL Admin.
The dynterm
directory also contains the unstemmed structure, which stores the full unstemmed version of terms in a memory mapped structure. Its memory and disk usage is therefore of the same order as that of the unique unstemmed terms, though the structure optimizes prefix matching, and so it stores common prefixes more efficiently. IDOL Server uses the unstemmed structure for wildcard searches, spelling correction, and fuzzy expansion.
Like the dynterm, minimizing the amount of data in the unstemmed index can improve performance. The UnstemmedMinDocOccs
parameter allows you to filter out very rare terms from the unstemmed index (although for some applications, such as legal search this might not be appropriate). UnstemmedIndexNumbers
and related settings can also prevent the unstemmed index from filling up with purely numeric or alphanumeric terms.
Index fields generate data for the dynterm and unstemmed structures. You should configure fields as Index fields if they contain data that requires conceptual searching, such as the body of an e-mail. It is typically better to store a highly structured field (like a document date) as one of the optimized types if it is required for search, and to make use of a FieldText or metadata search parameter. For example:
In the case of a document date, you can use DateType
or NumericDateType
field properties.
For more information about the dynterm
storage mode, see Repository Storage Mode.
TermCache
memory is transient, representing the memory in use by the server threads as they load the dynterm
information during query processing. This transient size depends on the terms that you query for.
The nodetable
structure and directory stores both metadata information about each document section, and the physical representation of each document. IDOL Server stores the metadata information (for example, date, database, or fieldcheck) in a memory mapped structure, with each section consuming 64 bytes. The physical representation is approximately equal in size to the original IDX or XML format.
The nodetable metadata is primarily used for fast filtering checks, such as DatabaseMatch
, MinDate
, MaxDate
, and FieldCheck
.
IDOL Server primarily uses the physical representation to print results. It can also be used to perform unoptimized FieldText matching.
IDOL Server uses the nodetable data to perform a MATCH
FieldText search on a field that is not specified as any other type.
Micro Focus generally recommends that you do not use this process, and it usually means that IDOL must perform a large number of loads from disk to find the documents that you want. Equally, do not make all fields that contain a number NumericType
if you only use them when printing the document content.
The NodeTableStoreContent
parameter and the StoredType
property allow you to choose which fields to store in the index. You must store the content for some functionality, such as AQG. You can still use highlighting and summarization functionality even if the content is not stored locally in the index, by sending the data to highlight or summarize back to the server. You can use the Regenerate
settings only for fields that are StoredType
.
You can use the NodeTableCompression
configuration parameter to compress the documents in the nodetable on disk. In this case, IDOL Server compresses data in the nodetable directory before storing it, reducing the IDOL Server disk footprint.
The numeric
structure gives a fast lookup of numeric value to document ID. Each numeric value stored uses approximately 16 bytes. A numeric field is normally wholly memory mapped, but you can limit the memory by using the NumericNormalMaxMem
property for a field that is also NumericType
. Making a field NumericType
greatly speeds up the FieldText operators EQUAL
, GREATER
, LESS
, NRANGE
, NOTEQUAL
, and BIAS
, and the geospatial FieldText operators DISTCARTESIAN
, DISTSPHERICAL
, BIASDISTCARTESIAN
, and BIASDISTSPHERICAL
. Sorting on a numeric field is also optimized.
NumericDateType
fields build an optimized index, which converts a date into an internal numeric autndate
format and stores this in the numeric structure. Its memory usage is identical to a numeric field. This optimizes the GTNOW
, LTNOW
, and RANGE
FieldText specifiers, and sorting on that field.
The Match
structure is a wrapper to the numeric structure, used for MatchType fields. Each unique value is mapped to an integer and then indexed into a numeric structure.
The memory requirement for Match
is equal to Numeric
, with the addition of a value mapping. The value mapping size is proportional to the size of all unique match values that have been indexed. You can limit the memory usage for a MatchType
field by also using the NumericNormalMaxMem
property.
The parametric
structure is used to store an optimized lookup of document ID to values in that document. It includes a value mapping and an explicit document file.
The value mapping contains details of the values that the parametric fields in your index contain.The value mapping is wholly memory mapped and its size is proportional to the size of all unique parametric values that have been indexed into the server.
The explicit document file contains information about the parametric fields and values that occur in each document. You can use the ParametricMemoryMaxSize
parameter to memory limit the explicit document file. Otherwise, it is mapped into memory for performance. Each parametric value in a document uses up to 8 bytes of storage in the explicit document file.
The sort
structure stores an index for each configured SortType
field to optimize sorting on those values. A SortType
field uses slightly more than SortFieldStorageLength
bytes per document, irrespective of whether the document has a value or if the value is smaller than the configured storage length. Micro Focus recommends that you use SortType
fields if:
Most of your documents have a value in the SortType
field.
Most of the values have a common prefix that you can remove by using SortFieldPrefixCSVs
.
The values (after removing any common prefix) all fit in the configured SortFieldStorageLength
.
For other cases it is typically better to use a NumericType
, NumericDateType
, or MatchType
value if you commonly use the field for sorting.
|