Word indexing rules and behaviour
IMPORTANT: These rules only apply when you are not using SQL Text Indexing.
Metadata word indexing for words in metadata fields such as Title, Notes, Additional text Fields etc. is performed as quickly and as comprehensively as possible and will index all word and character combination possibilities including email addresses and URLs.
When parsing text into words, Content Manager ensures it is indexed comprehensively and quickly.
To that end, rather than determining if a block of text is one word or two, perhaps because of a slash (/) in the middle, it will simply index all possibilities.
The parsing algorithm also recognises email addresses, and internet URLs.
Algorithm description
- Firstly text is broken up into blocks where each block is separated by white space characters.
The characters considered to be white space are space, tab, carriage return and line feed.
- Each block is processed by first removing trailing punctuation marks for ending sentences, .e.g. full stops, commas, colons, question marks, exclamation marks etc.
This leaves a block which may contain leading non-alpha numeric characters, some trailing non-alpha numeric characters and characters within the string that are also not alphanumeric e.g. !backup.dat#.
- The block is examined to see if it begins with a URL specifier followed by ://.
If it does and all the characters to the left of the :// are alphanumeric, then this specifier is stripped from the block before further processing and the block is marked as being a URL.
If it was marked as a URL, then the block is scanned for a forward slash (/) or colon (:) and if found, the host name part to the left of the forward slash or colon is indexed as a word and the sub-directory part to the right of the first forward slash is indexed as a word.
- At this point, the entire block is added to the word list for indexing as a single word
- The block is examined to see if it contains the at (@) symbol and if found, it is marked as being an email address if both first and last character are alpha numeric.
If it was marked as an email address, then the name to the left of the at (@) is indexed as a word and the domain to the right of the at (@) is indexed as a word.
- Now the block is split into words using any non-alphanumeric characters as word breaks.
Each resulting word is added to the word list for indexing.
- Repeat steps 1 to 6 by first stripping off 1 trailing non-alphanumeric character at a time from the block until the last character is alphanumeric.
Then repeat steps 1 to 6 by stripping off 1 non-alpha numeric character at a time from the front; then this will mean that for #$data$#, Content Manager will parse #$data$#, #$data$, #$data, $data$#, data$#, data$, data.
Examples
The following examples show the results that word indexing produces:
-
You can find elmer.fudd@mycompany.com using any of the following terms:
- ELMER.FUDD@MYCOMPANY.COM
- ELMER
- FUDD
- MYCOMPANY
- COM
-
You can find http://www.mycompany.com/ebiz/wibble by using any of the terms:
- WWW.MYCOMPANY.COM/EBIZ/WIBBLE
- WWW.MYCOMPANY.COM
- /EBIZ/WIBBLE
- WWW
- MYCOMPANY
- COM
-
You can find (ABC/123) using:
- ABC/123
- ABC
- 123
-
You can find (ABC)(DEF) using:
- ABC)(DEF
- ABC
- DEF