Content Indexes
For document content indexing (DCI) Content Manager uses Elasticsearch, an open-source enterprise-grade search engine as described in CM23.4_ElasticsearchInstall_Config.pdf, or OpenText Intelligent Data Operating Layer (IDOL) for document content indexing (DCI) as described in CM23.4_IDOL_DCI_Install_Config.pdf.
NOTE: For installations that use ControlPoint, it continues to use IDOL, so it may be necessary to run the Reindex tool for an IDOL database despite Content Manager using Elasticsearch.
For this functionality to be available, the feature Document Content Indexing must be selected in the System Options Features page.
IMPORTANT: For the IDOL Index and/or Elasticsearch Index tabs to be accessible in the Administration ribbon, the user must have a Security profile of <Highest>.
This function rebuilds the document content indexes, Location indexes with the user security settings and does Auto-Classification training.
It is not for re-indexing the words in the text fields. For this function, see Recreate word indexes.
NOTE: Highly recommended:
- Use a temporary index for re-indexing so that users can continue to work. To do this, create the new index in a new location. When the reindex is complete, copy it to the location of the original index.
- Run the re-indexing process on the Workgroup Server for efficiency and to pick up the defaults
You can use this function to select existing Content Manager records with electronic documents to index the document content for the first time or to reindex if they have not been indexed correctly, for example because of a network failure.
You should also use it when an entire Content Manager document content index has been lost or damaged, for an upgrade of a Content Manager 7.2 IDOL index and when the indexing parameters have been changed, for example which Additional Fields to index.
To improve the very first document content indexing process, see Content Indexes below.
-
Start the Content Manager client with the dataset to reindex
- On the Administration - then either IDOL Index or Elasticsearch Index ribbon, click Records.
The Reindex Records dialogue box appears. Complete the required fields:
Select tab - to select the records to reindex:
- Enter search string to select items to be reindexed - type or use the KwikSelect button to form a search string to select the records to index
- Maximum size of any file that can be content indexed (MB) - default: 200. Content Manager does not index the content of documents whose file size is greater than this number in megabytes. This data should be the same as in the Content Manager Enterprise Studio General - Miscellaneous - Properties - Miscellaneous dialogue box. Content Manager fills in the data automatically when the client runs on the same computer as the Workgroup Server.
- Maximum size of any archive file (zip, gz, tar) that can be content indexed (MB) - default: 50. Content Manager does not index the content of archive files whose file size is greater than this number in megabytes. Content Manager populates this field automatically, when possible. See previous option for details.
Protocol tab - to specify the indexing computer:
- IDOL Server Name - the computer that runs the Content Manager IDOL Service, the main IDOL Server
- IDOL Server Port - IDOL Server port number, used for searching. See next option for more.
- IDOL Server Port (Index) - Used for sending documents to the server.
The port numbers should remain at their defaults unless you specifically changed the IDOL configuration file in the folder C:\Program Files\Micro Focus\Content Manager\IDOL\TRIM IDOL Service or your equivalent.
- Instance name for Content Manager data - name of the instance on the IDOL server that this dataset communicates with. Defaults to CM_[database ID].
When you create a content index for a dataset, Content Manager creates an instance on the IDOL server, which is what Content Manager uses for all IDOL searches.
Should not be changed, unless you reindex the dataset afterwards, as Content Manager Enterprise Studio would create a new instance on the IDOL server when the instance name changes.CAUTION: While you can change the value in this field and reindex to a different instance, it is not recommended. When you reindex into a different instance, then Content Manager will not be able to find anything in that index, and creates unusable data in the other index.
The only scenario when an experienced Content Manager and IDOL administrator may want to reindex to a different instance, is to create your IDOL index offline, for example, for a test dataset or to create a new index while still using an old content index. - IDOL Server is the OEM service shipped with Content Manager - select when you are using Content Manager's OEM IDOL version for indexing
- Ignore read-only warnings for IDOL instance - Select this option to ignore read-only warnings for content engines that have been set to read-only. Only suppress the warning if some of the content engines have been specifically marked as read-only.
- Reindex as Dataset - To reindex content to a different dataset, specify in this field the target dataset value (e.g. CM_45). This makes it possible to create a valid content engine for a production database while running the reindex on a copied dataset.
- Maximum queued transactions - default: 20, which is the number of records in the queue. The smaller this number, the shorter the queue, which makes processing more reliable. There should be no need to change this number other than for troubleshooting. In that case, you could even reduce it to 1 to process one record at a time.
Options tab - to specify the indexing parameters:
- Number of processing threads - default: 5. You can use a number from 1 to 20. The greater the number of threads, the faster the indexing process. It depends on your hardware capacity how many threads you can run.
- Log output to - the log file location
- Produce detailed log messages - select to see more detail in the log files
- Perform metadata updates only - only relevant for IDOL metadata that is available to third-party applications using IDOL indexes, for example OpenText ControlPoint. Select to only update the IDOL record metadata. Useful when for example you have changed the metadata to index, for example by adding an Additional Field to index, but when the electronic documents have not changed.
- Only reindex missing records - select to reindex only records that have not previously been indexed. Useful to index the records that were not indexed, for example because the indexing run did not complete for some reason.
- Only report missing records - available when the option Only reindex missing records is selected - select to merely find the records that have not been indexed before and then report them in the log files, but not reindex them.
- Only Generate Settings file - select this option to save the configuration settings as an XML file. By default, it's saved to the defined Log file path.
General tab - to select the records to reindex:
- Search String - type or use the KwikSelect button to form a search string to select the records to index
- Number of indexing threads - default: 4. You can use a number from 1 to 20. The greater the number of threads, the faster the indexing process. It depends on your hardware capacity how many threads you can run.
- Working Directory - browse to, and select, the path that will be the Working Directory for the Elasticsearch processes.
- Enable Logging - select this option to generate a log for Elasticsearch processes.
- Produce detailed log messages - select to see more detail in the log files.
- Only Generate Settings file - select this option to save the configuration settings as an XML file; if this option is selected a reindex is not performed. By default, it's saved to the defined Working Directory path.
Protocol tab - to specify the indexing computer:
- Use content index configured for this dataset - select this option to use the Elasticsearch server configured in Content Manager Enterprise Studio.
NOTE: This option will only be available if an Elasticsearch content index is configured in Content Manager Enterprise Studio.
- User another content index - select this option to use an Elasticsearch server that is different to the one configured in Content Manager Enterprise Studio.
- Elasticsearch Server URL - enter the Elasticsearch Server URL.
- Index Name - name of the instance on the Content Index Server that this dataset communicates with. Defaults to cm_[database ID].
When you create a content index for a dataset, Content Manager creates an instance on the Content Index server, which is what Content Manager uses for all index searches.
Should not be changed, unless you reindex the dataset afterwards, as Content Manager Enterprise Studio would create a new instance on the Content Index server when the instance name changes.CAUTION: While you can change the value in this field and reindex to a different instance, it is not recommended. When you reindex into a different instance, then Content Manager will not be able to find anything in that index, and creates unusable data in the other index.
The only scenario when an experienced Content Manager and Content Index administrator may want to reindex to a different instance, is to create your Content Index offline, for example, for a test dataset or to create a new index while still using an old content index.
Authentication
On the Protocol tab, click Authentication to set the Custom Authentication settings:
Enable Elasticsearch X-Pack authentication - select this option to enable X-Pack authentication. You can either specify a user name and password or a certificate.
- User Name - enter the user name to connect to the Elasticsearch server.
- Set Password - click Set Password to enter the password for the user being used for authentication.
- Client Certificate - if using a certificate for the X-Pack authentication, the certificate must be installed to the Personal store of the Local Computer account. Once installed, the certificate will appear in the Client Certificate drop-down list. Select the required certificate from the drop-down list.
- View Certificate - click to view the selected Client Certificate.
TIP: If you get certification validation errors, ensure that the local computer trusts the certificate authority of the certificate that the Elasticsearch server is presenting.
Enable Amazon Web Service (AWS) authentication - select this option to use the Amazon Web Service (AWS) version of the Elasticsearch service.
- Access Key - enter the AWS Access Key. This is the username for the AWS IAM user that has access to the Elasticsearch services inside AWS.
Secret Key - enter the AWS Secret Key. This is the password for the Access Key user.
-
AWS Region - enter the AWS Region name.
- Use Amazon Security Token Service (STS) - To create and provide trusted user with temporary security credentials that can control access to your AWS resources.
- URL – enter the URL for STS endpoint (example: https://sts. ap-southeast-1.amazonaws.com)
- Region - enter the AWS STS Region name (example: ap-southeast-1)
- RoleARN - enter the Role that delegates access to the Amazon AWS resource for the AWS IAM user.
Options tab - to specify the indexing parameters:
- Maximum size of any file that can be content indexed (MB) - default: 2048. Content Manager does not index the content of documents whose file size is greater than this number in megabytes. This data should be the same as in the Content Manager Enterprise Studio General - Miscellaneous - Properties - Miscellaneous dialogue box. Content Manager fills in the data automatically when the client runs on the same computer as the Workgroup Server.
- Maximum size of any archive file (zip, gz, tar) that can be content indexed (MB) - default: 50. Content Manager does not index the content of archive files whose file size is greater than this number in megabytes. Content Manager populates this field automatically, when possible. See previous option for details.
- Update record metadata only - only relevant for Content Index metadata that is available to third-party applications using the content indexes, for example OpenText ControlPoint. Select to only update the record metadata. Useful when for example you have changed the metadata to index, for example by adding an Additional Field to index, but when the electronic documents have not changed.
- Only index missing records - select to reindex only records that have not previously been indexed. Useful to index the records that were not indexed, for example because the indexing run did not complete for some reason.
- Only report missing records - available when the option Only index missing items is selected - select to merely find the records that have not been indexed before and then report them in the log files, but not reindex them.
- Reindex records without content - select to check the Elasticsearch index for records that have electronic documents associated with them but do not have their content in the index. When this option is set only records with missing content will be reindexed.
- Report records without content - as per the above option but generates a reports on which records are missing content and writes this to the log file.
NOTE: The Report missing records and Report records without content can only be run with a single thread. If the thread count is greater than 1, a warning will be displayed.
- Access files directly from the document store - the default indexing behaviour is to transfer the document from the document store to the client’s working directory using the Content Manager Workgroup server and then extract the text there. When this option is selected the document content is extracted directly from the document store, as long as the file can be read. This will improve indexing performance if the indexing client is located close to the document store, that is, on the same network space, and also reduce the amount of disk space required. This option is not recommended over slow network connections.
- Remove Existing Documents - select this option to remove existing child items that are created as a part of Elasticsearch indexing before running the reindex.
When a record’s electronic document is indexed into Elasticsearch, and there is text content that can be indexed, at least 2 entries are created in the Elasticsearch index. The first entry is the parent item (or ‘document’ in Elasticsearch terminology) that contains all the metadata for the record, and the second is a child item that contains the content of the electronic document. Depending on the content index setting in Content Manager Enterprise Studio for “Elastic Document Content Field size limit”, and the amount of text to be indexed, there may be many more child items (documents) created. Additionally, if the electronic document is a compound file (e.g. a zip file), then at least one child item (document) is created for each file found in the compound file.
If the electronic document is updated, then we have no way of knowing if the number of child items will change, so we need to do a search for all child items and remove them before updating the entry for this record. This is what happens for a content index event update for the Record. However, there is the overhead of searching for, and then removing these child items. When performing a reindex operation on a large dataset, this has the potential to significantly slow down the operation. For a new index that has no data, or for the case where the user knows that the electronic documents have not changed to a large degree, then disabling this option will allow the reindex to complete sooner.
- Click OK.
Content Manager starts re-indexing and the Reindexing Progress dialogue appears with the status of the items to index:
- View Log - click to see the log file
- Pause - click to pause the indexing process for all threads. Useful for example to add disk space for the index. When paused, click Resume to continue indexing.
- When indexing has completed, the Pause button becomes unavailable. Click Close to close the dialogue box.
NOTE: If using Elasticsearch as your Document Content Index, see Elasticsearch Metadata Index for details on how to index metadata.
NOTE: See also Content Manager Enterprise Studio Help for information about Content Manager indexing.