Tokenization

IDOL Server stores document text as a series of tokens. Generally, a token is a word, but it can also include other strings of characters (such as a phone number or e-mail address).

During the index process, IDOL Server converts the text into tokens for matching, and stores it in Index fields. The following sections describe how IDOL Server converts text into tokens, and how you can manipulate the results.

Types of Characters

IDOL Server processes characters according to their common use, and uses them to define tokens in text.

IDOL Server contains a default set of definitions, which is appropriate in most cases. You might want to change the default definitions if your data contains special strings that you want to search for, such as e-mail addresses. See Example: E-mail Addresses.

In IDOL Server, the following three types of character define tokens and the breaks between tokens:

Character type	Description
Text character	Letters and numbers, including logograph characters from Asian writing systems.
Separator character	Characters that separate two words, such as spaces, tabs, and line breaks.
Non-separator character	Other characters, such as punctuation.

When you index data into IDOL Server:

Text characters	Do not change.
Separator characters	Become spaces, which mark the break between one token and the next.
Non-separator characters	Are deleted. If text is separated only by non-separator characters, it becomes a single token.

Example: E-mail Addresses

Consider the following e-mail address:

joe.smith@example.com

In the default IDOL Server configuration:

The at symbol (@) is a separator character.
The period (.) is a non-separator character.

When IDOL Server processes the e-mail address, it produces two tokens: JOESMITH and EXAMPLECOM, which you can search for to return this document.

However, if you search for JOE or EXAMPLE, this document does not return. If you need to be able to search for these terms, you must alter the default configured separator characters. For this example, you define a period (.) as a separator character.

To add characters to the list of text characters, use the TangibleCharacters configuration parameter. For example, for indexing social media documents, you might want to be able to include a hashtag (#) or @ symbol. For more information, refer to the IDOL Server Reference.

Hyphen Characters

You can use the additional setting HyphenChars to index both the subparts and the whole of a hyphenated word.

For example, if you set HyphenChars=. then the text:

email joe.smith

is tokenized into four terms, EMAIL, JOE, SMITH, and JOESMITH. Searches for Joe, Joe Smith, Joe.Smith, and joesmith all match the document.

However, the subparts of the hyphenated terms are indexed without a position. This means that you cannot match them with a phrase or proximity query. For example, for the text above, a query for the phrase email joe does not match.

For this reason, Micro Focus recommends that you turn hyphenation off in almost all cases, by setting HyphenChars to NONE, and instead use AugmentSeparators. For example:

AugmentSeparators=.

This configuration matches all the same queries, except for joesmith, which is a rarely used query. The benefits of AugmentSeparators generally greatly outweigh the value of this type of query.

Number Punctuation

The NumberPunctuation option specifies characters that must be treated differently in numeric and alphabetic terms. The setting defines characters to treat as TangibleCharacters when they appear with a numeric character on either side. For example, if the period (.) is a separator:

123.com is tokenized as 123 and COM.
3.14 is tokenized as 3.14.

TIP:

You can set the NumberPunctuation character individually for different languages.

Tokenization

Types of Characters

Example: E-mail Addresses

Hyphen Characters

Number Punctuation

Other Topics in this Section