Open topic with navigation
IDOL Server stores document text as a series of tokens. Generally, a token is a word, but it can also include other strings of characters (such as a phone number or e-mail address).
During the index process, IDOL Server converts the text into tokens for matching, and stores it in Index fields. The following sections describe how IDOL Server converts text into tokens, and how you can manipulate the results.
IDOL Server processes characters according to their common use, and uses them to define tokens in text.
IDOL Server contains a default set of definitions, which is appropriate in most cases. You might want to change the default definitions if your data contains special strings that you want to search for, such as e-mail addresses. See Example: E-mail Addresses.
In IDOL Server, the following three types of character define tokens and the breaks between tokens:
|Text character||Letters and numbers, including logograph characters from Asian writing systems.|
|Separator character||Characters that separate two words, such as spaces, tabs, and line breaks.|
|Non-separator character||Other characters, such as punctuation.|
When you index data into IDOL Server:
|Text characters||Do not change.|
|Separator characters||Become spaces, which mark the break between one token and the next.|
|Non-separator characters||Are deleted. If text is separated only by non-separator characters, it becomes a single token.|
Consider the following e-mail address:
In the default IDOL Server configuration:
The at symbol (@) is a separator character.
The period (
.) is a non-separator character.
When IDOL Server processes the e-mail address, it produces two tokens:
EXAMPLECOM, which you can search for to return this document.
However, if you search for
EXAMPLE, this document does not return. If you need to be able to search for these terms, you must alter the default configured separator characters. For this example, you define a period (
.) as a separator character.
To add characters to the list of text characters, use the
TangibleCharacters configuration parameter. For example, for indexing social media documents, you might want to be able to include a hashtag (#) or @ symbol. For more information, refer to the IDOL Server Reference.
You can use the additional setting
HyphenChars to index both the subparts and the whole of a hyphenated word.
For example, if you set
HyphenChars=. then the text:
is tokenized into four terms,
JOESMITH. Searches for Joe, Joe Smith, Joe.Smith, and joesmith all match the document.
However, the subparts of the hyphenated terms are indexed without a position. This means that you cannot match them with a phrase or proximity query. For example, for the text above, a query for the phrase email joe does not match.
For this reason, Micro Focus recommends that you turn hyphenation off in almost all cases, by setting
NONE, and instead use
AugmentSeparators. For example:
This configuration matches all the same queries, except for joesmith, which is a rarely used query. The benefits of
AugmentSeparators generally greatly outweigh the value of this type of query.
NumberPunctuation option specifies characters that must be treated differently in numeric and alphabetic terms. The setting defines characters to treat as
TangibleCharacters when they appear with a numeric character on either side. For example, if the period (
.) is a separator:
123.com is tokenized as
3.14 is tokenized as
You can set the
NumberPunctuation character individually for different languages.