Document tracking is a feature present in IDOL components that are involved in indexing. It reports upon the progress of documents as they pass through an index chain. Every time a document reaches a certain stage in the indexing process, the component commits event data to a back end, which stores the events.
The back end can be a log file, or an SQL database (this option requires an additional library, which is available in the IDOL Server installers). You retrieve the events by using an appropriate interface for your chosen back end, for example an SQL client for a database.
You can use document tracking to detect problems with the indexing process.
Document tracking might flag when a batch of documents is rejected because there are misconfigured fields, or when an existing document is deleted.
The stored event data includes the references of documents that you index, and other metadata, so that you can quickly find the exact source of problems. Similarly, the data can point to issues with individual components in the indexing system.
If one Content component commits event data at a lower rate than expected, it might indicate a bottleneck in your indexing process in or near that component.
Document tracking provides an overview of what happens to a document as it moves through the index process. It is useful for troubleshooting.
You might expect a document to pass through a certain indexing stage, but the tracking data shows that it did not reach this stage. This might indicate that there is an issue with the missing stage, or that the document somehow does not qualify for it. In either case, you can use this information to diagnose and solve the problem.
For full details of how to configure document tracking, refer to the IDOL Server Administration Guide and IDOL Server Reference.
The document tracking components commit event data to the back end in batches for efficiency. The components write data to an intermediate log file, which it stores in the directory specified by the DocTrackDir
configuration parameter. The component periodically processes this file, where the frequency of processing depends on the values of the TimeoutSeconds
and MaxEventsPerFile
configuration parameters. When either the specified period elapses, or the file has the specified maximum number of events, the component commits that data to the back end.
Generally, set TimeoutSeconds
to a small value, so that the back end is an up-to-date model of the document status in your system, and set MaxEventsPerFile
to a large value, because sending larger batches to the back end is more efficient when the system is under heavy indexing load.
Do not modify or delete files in the DocTrackDir
during operation. You can clear the directory during downtime if you want to discard events.
Each component must have its own DocTrackDir
, which must not be shared.
In a unified IDOL Server configuration under the IDOL Proxy component, the DocTrackDir
might be shared by multiple components. In this case, use a relative path for the DocTrackDir
to create a subdirectory, so that each component creates its own directory. Do not use a relative path that traverses back up the file tree.
If the document tracking component cannot contact the back-end server, the component keeps the events and reattempts committing the file.
You can optionally configure the maximum size of the DocTrackDir
, by using the DocTrackDirMaxSizeKB
configuration parameter. In most cases, the default value is appropriate, and Micro Focus recommends using a large value because the component discards events when the directory reaches the maximum size.
Each indexing component generates different events.
A Content component creates an Indexed
event when the document is indexed, and a Committed
event when the document is available for querying.
Most connectors generate events that describe the creation of tasks. The HTTP and File System Connectors create an Added
event when it finds a new file or Web page for indexing and creates a document to represent it. They create an Updated
event when a file or Web page has changed (which the Content component will eventually process as a DREREPLACE
index action).
CFS creates events representing the continuations of events that other connectors generate. It creates an Import:Queue
event when a document reaches CFS and is in the index queue. It creates an Update Received
event when a connector sends an update operation, which CFS converts to a DREREPLACE
index action to send to the Content component.
Each event also has a source string, which has the format COMPONENTNAME_IPADDRESS_SERVICEPORTNUMBER
.
A Content component might have the source string content_10.2.106.5_5502
.
This source string allows you to easily see the workflow for each document as it passes through the indexing system. Each component has a unique source string, so you can quickly identify any problems with a component, even in a complex indexing setup.
You can use the IndexUID
parameter in DREADD
and DREADDDATA
index actions to track the progress of a batch of documents. This parameter adds the value you specify as an identification tag to each document in the batch. You can query the back end for this tag to track the progress of each document in the batch.
When you build and configure your document tracking solution, consider the following factors:
Do you need to install additional software or database drivers? Are any other dependencies required for your setup to work?
If you do not have permissions to install the dependencies on your host machine, you might want to use a simpler setup, or one that uses existing installations.
In a distributed setup, consider what is an acceptable rate of data insertion? How far apart are your indexing components and back end in the network? You might want to test with different values of the TimeoutSeconds
and MaxEventsPerFile
configuration parameters to find the optimal values for your setup.
|