Term Analysis

Term analysis of a document or set of documents returns statistical information on the terms in the documents. The information can in turn be used to perform conceptual operations, and it is integral in additional functionality such as categorization, clustering, and profiling.

The TermGetAll action allows you to list the terms that are stored in IDOL Server. You can also view information about terms in the indexed data on the Terms tab on the Performance page in the Monitor section of IDOL Admin.

You can also return additional information about the terms in your documents:

Display Details About the Terms in IDOL Server

You can return details about the types of terms that occur in IDOL Server, by setting the TermAnalysis parameter to True in the TermGetAll action.

TermAnalysis returns the following information, in <autn:termanalysis> tags:

Detail	Description
`Terms`	The total number of terms in IDOL Server.
`Numeric`	The number of purely numeric terms. For example 13570.
`Alphanumeric`	The number of alphanumeric terms. This count excludes purely numeric terms. For example A739B, 12a.
`Multibyte`	The number of terms that include at least one multibyte character. For example, café or ψάρι.
`Dococcs logn`	The number of terms that occur in between `2`^(n-1)`+1` and `2`ⁿ documents. For each item, the number in quotes (`""`) is the value of `n`. For example: `<autn:dococcs logn="0">` is the number of terms that occur in 2⁰ documents (1 document). `<autn:dococcs logn="1">` is the number of terms that occur in between 2⁰+1 and 2¹ (2) documents. `<autn:dococcs logn="2">` is the number of terms that occur in between 2¹+1 and 2²(3-4) documents. `<autn:dococcs logn="3">` is the number of terms that occur in between 2²+1 and 2³(5-8) documents.
`Length len`	The number of terms of each length. NOTE: This length counts a multibyte character as length 1. For example, the word café is counted as length 4, even though the UTF-8 string is 5 bytes.

Detail

Description

Terms

The total number of terms in IDOL Server.

Numeric

The number of purely numeric terms. For example 13570.

Alphanumeric

The number of alphanumeric terms. This count excludes purely numeric terms. For example A739B, 12a.

Multibyte

The number of terms that include at least one multibyte character. For example, café or ψάρι.

Dococcs logn

The number of terms that occur in between 2^(n-1)+1 and 2ⁿ documents. For each item, the number in quotes ("") is the value of n. For example:

<autn:dococcs logn="0"> is the number of terms that occur in 2⁰ documents (1 document).
<autn:dococcs logn="1"> is the number of terms that occur in between 2⁰+1 and 2¹ (2) documents.
<autn:dococcs logn="2"> is the number of terms that occur in between 2¹+1 and 2²(3-4) documents.
<autn:dococcs logn="3"> is the number of terms that occur in between 2²+1 and 2³(5-8) documents.

Length len

The number of terms of each length.

NOTE:

This length counts a multibyte character as length 1. For example, the word café is counted as length 4, even though the UTF-8 string is 5 bytes.

To return only the analysis information, send TermGetAll with TermAnalysis set to True, and MaxTerms set to 0. For an example response, see Example TermGetAll TermAnalysis Response.

Find the Most Common Terms

You can use the TermGetAll action to find the most common terms in IDOL Server.

Set Type to TrueOccs to return terms in order of the number of times they occur in IDOL Server. The most common terms are listed first.

<autnresponse> <action>TERMGETALL</action> <response>SUCCESS</response> <responsedata> <autn:number_of_terms>3675990</autn:number_of_terms> <autn:term document_occurrences="6117" total_occurrences="11798">ON</autn:term> <autn:term document_occurrences="4973" total_occurrences="11690">NEW</autn:term> <autn:term document_occurrences="5387" total_occurrences="9964">FIRST</autn:term> <autn:term document_occurrences="4923" total_occurrences="9637">YEAR</autn:term> <autn:term document_occurrences="4060" total_occurrences="9237">STATE</autn:term> <autn:term document_occurrences="4740" total_occurrences="8335">TIME</autn:term> <autn:term document_occurrences="3862" total_occurrences="7939">US</autn:term> <autn:term document_occurrences="4801" total_occurrences="7606">TWO</autn:term> <autn:term document_occurrences="3936" total_occurrences="7587">NAME</autn:term> <autn:term document_occurrences="3133" total_occurrences="7485">UNIT</autn:term> ...

TIP:

By default, TermGetAll returns the stemmed terms. To return the unstemmed forms, you can set the Stemming parameter to False in the TermGetAll action. See Stemming With Term Actions.

Find the Documents with the Most or Least Terms

You can use the FieldText parameter of an action along with document metadata fields to restrict the documents returned. For example:

This action returns the documents that contribute to the count in the <autn:distincttermsperdoc logn="1"> tag.

This action returns the documents that contribute to the count in the <autn:termsperdoc logn="10"> tag.

You can use this method to investigate potentially corrupted documents. For example, documents in the <autn:distincttermsperdoc lognx="20"> segment have roughly 2²⁰ (over one million) distinct terms. The Oxford English Dictionary second edition has approximately 600,000 definitions, so an English document in this segment has many more distinct terms than there are English words. This document might contain a high number of garbage terms, or numeric and alphanumeric terms.

At the other end, documents in the <autn:distincttermsperdoc lognx="0"> segment have only one distinct term. This segment might also indicate a problem, or missing data.

`document_occurrences`	is the number of documents that the term occurs in.
`total_occurrences`	is the total number of times that the term occurs in all documents in IDOL Server.