Auto-Classification

Setting up Content Manager for auto-classification

In your Classification hierarchy, configure at least one Classification to be a holding bay for the auto-classification process:
In the Classification's properties dialogue box, select the tab Auto-Classification and select the option Use this Classification to hold records waiting to be auto-classified. See Classification Auto-Classification page.
In your Classification hierarchy, configure each Classification that Content Manager should use for auto-classification:
In the Classification's properties dialogue box, select the tab Auto-Classification and select the option Make this Classification available for selection by the auto-classification process and select the options to suit your organisation's needs.
See Classification Auto-Classification page.
Under each Classification that you configured for auto-classification of records, it is recommended to have at least 50 records with electronic documents that you think are very good examples of the records that should be under this Classification. Content Manager can then use the 50 oldest records under each Classification to train itself in the criteria for the records under each Classification.
Run Auto-Classification on the Content Manager Administration - then either IDOL Index or Elasticsearch Indexribbon.
The Auto-Classification Training dialogue appears. Complete the required fields:

IDOL Auto-Classification

Select tab - to select the records to be trained:

Enter search string to select classifications to be created and trained - type or use the KwikSelect button to form a search string to select the records to be created and trained

IDOL Server tab - to specify the indexing computer:

IDOL Server Name - the computer that runs the Content Manager IDOL Service, the main IDOL Server
IDOL Server Port - IDOL Server port number, used for searching. See next option for more.
IDOL Server Port (Index) - Used for sending documents to the server.
The port numbers should remain at their defaults unless you specifically changed the IDOL configuration file in the folder C:\Program Files\Micro Focus\Content Manager\IDOL\TRIM IDOL Service or your equivalent.
Instance name for Content Manager data - name of the instance on the IDOL server that this dataset communicates with. Defaults to CM_[database ID].
When you create a content index for a dataset, Content Manager creates an instance on the IDOL server, which is what Content Manager uses for all IDOL searches.
Should not be changed, unless you reindex the dataset afterwards, as Content Manager Enterprise Studio would create a new instance on the IDOL server when the instance name changes.
CAUTION: While you can change the value in this field and reindex to a different instance, it is not recommended. When you reindex into a different instance, then Content Manager will not be able to find anything in that index, and creates unusable data in the other index.
The only scenario when an experienced Content Manager and IDOL administrator may want to reindex to a different instance, is to create your IDOL index offline, for example, for a test dataset or to create a new index while still using an old content index.

IDOL Server is the OEM service shipped with Content Manager - select when you are using Content Manager's OEM IDOL version for indexing
Ignore read-only warnings for IDOL instance - Select this option to ignore read-only warnings for content engines that have been set to read-only. Only suppress the warning if some of the content engines have been specifically marked as read-only.
Reindex as Dataset - To reindex content to a different dataset, specify in this field the target dataset value (e.g. CM_45). This makes it possible to create a valid content engine for a production database while running the reindex on a copied dataset.
Maximum queued transactions - default: 20, which is the number of records in the queue. The smaller this number, the shorter the queue, which makes processing more reliable. There should be no need to change this number other than for troubleshooting. In that case, you could even reduce it to 1 to process one record at a time.

Options tab - to specify the indexing parameters:

Number of processing threads - more than four threads should not be necessary
Log output to - to specify the location for the log file
Produce detailed log messages - select for the log to contain more detail
Use classification notes for training - select to use the notes on the Classifications to train IDOL. IDOL gives more weight to the terms in the Classification notes, but typically finds more representative terms in the electronic documents under the Classification.
Use records within Classification structure for training - select for IDOL to train itself by using the records that are already in the Classification system. When this option is selected, IDOL automatically identifies the 50 oldest records based on Date Registered under the Classifications and uses them for training. The oldest records are the most likely to have been correctly filed under a Classification, as they had the longest time to have mistakes corrected.
NOTE: If this option is not selected, IDOL will use all records returned by the search for training.

Only Generate Settings file - select this option to save the configuration settings as an XML file. By default, it's saved to the defined Log output path.

Elasticsearch Auto-Classification

General tab -to select the records to be trained:

Search String - type or use the KwikSelect button to form a search string to select the records to be created and trained.
Number of indexing threads - default: 4. You can use a number from 1 to 20. The greater the number of threads, the faster the indexing process. It depends on your hardware capacity how many threads you can run.
Working Directory - browse to the directory that the Elasticsearch Auto-Classification training logs will be written to.

Enable Logging - select this option to generate a log for Elasticsearch processes.
- Produce detailed log messages - select to see more detail in the log files.
Only Generate Settings file - select this option to save the configuration settings as an XML file. By default, it's saved to the defined Working Directory path.

Protocol tab, specify the indexing computer:

Use content index configured for this dataset - select this option to use the Elasticsearch server configured in Content Manager Enterprise Studio.
NOTE: This option will only be available if an Elasticsearch content index is configured in Content Manager Enterprise Studio.
User another content index - select this option to use an Elasticsearch server that is different to the one configured in Content Manager Enterprise Studio.
- Elasticsearch Server URL - enter the Elasticsearch Server URL.
- Index Name - name of the instance on the Content Index Server that this dataset communicates with. Defaults to cm_[database ID].
  When you create a content index for a dataset, Content Manager creates an instance on the Content Index server, which is what Content Manager uses for all index searches.
  Should not be changed, unless you reindex the dataset afterwards, as Content Manager Enterprise Studio would create a new instance on the Content Index server when the instance name changes.
  CAUTION: While you can change the value in this field and reindex to a different instance, it is not recommended. When you reindex into a different instance, then Content Manager will not be able to find anything in that index, and creates unusable data in the other index.
  The only scenario when an experienced Content Manager and Content Index administrator may want to reindex to a different instance, is to create your Content Index offline, for example, for a test dataset or to create a new index while still using an old content index.

Authentication

On the Protocol tab, click Authentication to set the Custom Authentication settings:

Options tab, choose the options for processing the Classifications and how the Content Index engine should train:

Preserve training data from the previous run - select this option to add any classifications that are trained to the existing training data.
NOTE: When this option is selected, only a single thread will be used to perform the training, even if the user has specified more. If they are on the General tab when they start the training run, they will get a warning message and the thread count will be set to 1. If the option is not selected, then any previous training data is deleted and replaced with the new results.
Use classification notes for training - select to use the notes on the Classifications to train the Content Index engine.
Use records in the classification structure for training - select for the Content Index engine to train itself by using the records that are already in the Classification system. When this option is selected, the Content Index engine automatically identifies the oldest records based on Date Registered under the Classifications and uses them for training. The oldest records are the most likely to have been correctly filed under a Classification, as they have had the most time to have any mistakes corrected.
- Number of records per classification to use - determine the number of records from each classification to use for training. A value of 0 will use all the records in the classification.

Click OK to start training.
Content Manager works through the selected Classification terms, creates them in the Content Index engine as categories and generates the training terms that it finds best represent the records under the Classifications.
When the process is complete, Content Manager displays a message.
To view or modify the training terms the Content Index engine generated for a Classification, use the Classification's Auto-Classification tab button Category Training. See Classification Auto-Classification page.
In Content Manager Enterprise Studio, your organisation's Content Manager administrator must configure event processing for the dataset by setting the event processor type Auto Classify Records to Enabled and choose a Workgroup Server to process the events. See TRIMEnterpriseStudio.chm in your Content Manager installation folder.

Running auto-classification

Auto-classification starts processing after your system has been set up as described above. Content Manager periodically checks the Classification terms that were designated as holding bays for records. For each record it finds, it asks the Content Index engine to suggest a Classification. If the Content Index engine 's confidence in the suggested Classification is equal to or higher than the configured minimum confidence level, Content Manager moves the record to that Classification. Security, retention details and all other inherited properties are automatically updated when the record is moved.

The Content Index engine needs to have already content indexed the record; so if the content index event has not yet been processed, or if the Content Index engine has not yet updated its index, the record will be skipped until it is available in the content index.

Reviewing auto-classification

The auto-classification process requires monitoring and fine tuning by using the Classification Auto-Classification tab function Category Training, particularly in the early stages. See Category Training.

Over time, the Content Index engine will get better and better at finding the correct Classification for each record and there will be much less maintenance required.

After a record has gone through the auto-classification process, Content Manager stores the Content Index engine's confidence level in the record field Auto-Classification Confidence Level. You can review this field to help adjust the minimum confidence levels of the Classifications.
You can use the Administration ribbon System Options - Classification page option When auto-classifying a Record, store details on a log file to log the details of the auto-classification process to a generated log file, by default, the log file is written to C:\Micro Focus Content Manager\ServerLocalData\TRIM\<DBID>\Log\StoredAutoClassifyDetails.log. This includes the details from the Content Index engine confidence level and the terms it matched the record against. This should explain quite clearly why the Content Index engine suggested to move the record to a particular Classification.
You will also find those details in the log file when the auto-classification process failed because the Content Index engine was not confident enough to suggest a Classification for a record.

Updating Auto-Classification

The Update Auto-Classification option allows you to update the details of a Classification in regards to it's usage with the Auto-Classification module.

From the Manage tab, click Classification. The Classifications window will open.
Tag the Classification(s) to be updated, right-click and click Update Auto-Classification. The Update Auto-Classification dialogue will appear.
Select either:

This classification is not suitable for the auto-classification process - this will remove the selected Classification(s) from the Auto-Classification process.
Make this classification available for selection by the auto-classification process - this will add the selected Classification(s) to the Auto-Classification process.
- Minimum confidence level required - a number between 0 and 100 that indicates how certain the Content Index should be at a minimum when filing records under this Classification. When the Content Index confidence level about a record does not meet the minimum specified here, it will not file the record under this Classification.
  When using a higher value, the Content Index would only file records under this Classification when it is quite sure. When using a lower value, it would file more records, but probably with less accuracy. At a value of 0, the confidence level setting is completely turned off. You can probably turn off the setting after the initial learning period, when you are confident that the Content Index engine files records under this Classification correctly.

Click OK to save changes.