Term Analysis

Term analysis of a document or set of documents returns statistical information on the terms in the documents. The information can in turn be used to perform conceptual operations, and it is integral in additional functionality such as categorization, clustering, and profiling.

The TermGetAll action allows you to list the terms that are stored in IDOL Server. You can also view information about terms in the indexed data on the Terms tab on the Performance page in the Monitor section of IDOL Admin.

You can also return additional information about the terms in your documents:

Display Details About the Terms in IDOL Server

You can return details about the types of terms that occur in IDOL Server, by setting the TermAnalysis parameter to True in the TermGetAll action.

TermAnalysis returns the following information, in <autn:termanalysis> tags:

Detail Description
Terms The total number of terms in IDOL Server.
Numeric The number of purely numeric terms. For example 13570.
Alphanumeric The number of alphanumeric terms. This count excludes purely numeric terms. For example A739B, 12a.
Multibyte The number of terms that include at least one multibyte character. For example, café or ψάρι.
Dococcs logn

The number of terms that occur in between 2(n-1)+1 and 2n documents. For each item, the number in quotes ("") is the value of n. For example:

  • <autn:dococcs logn="0"> is the number of terms that occur in 20 documents (1 document).

  • <autn:dococcs logn="1"> is the number of terms that occur in between 20+1 and 21 (2) documents.

  • <autn:dococcs logn="2"> is the number of terms that occur in between 21+1 and 22 (3-4) documents.

  • <autn:dococcs logn="3"> is the number of terms that occur in between 22+1 and 23 (5-8) documents.

Length len

The number of terms of each length.

NOTE:

This length counts a multibyte character as length 1. For example, the word café is counted as length 4, even though the UTF-8 string is 5 bytes.

To return only the analysis information, send TermGetAll with TermAnalysis set to True, and MaxTerms set to 0. For an example response, see Example TermGetAll TermAnalysis Response.

Find the Most Common Terms

You can use the TermGetAll action to find the most common terms in IDOL Server.

Set Type to TrueOccs to return terms in order of the number of times they occur in IDOL Server. The most common terms are listed first.

For example:

action=TermGetAll&Type=TrueOccs&MaxTerms=100

This action returns the 100 most common terms in IDOL Server. For example:

<autnresponse>
   <action>TERMGETALL</action>
   <response>SUCCESS</response>
   <responsedata>
      <autn:number_of_terms>3675990</autn:number_of_terms>
      <autn:term document_occurrences="6117" total_occurrences="11798">ON</autn:term>
      <autn:term document_occurrences="4973" total_occurrences="11690">NEW</autn:term>
      <autn:term document_occurrences="5387" total_occurrences="9964">FIRST</autn:term>
      <autn:term document_occurrences="4923" total_occurrences="9637">YEAR</autn:term>
      <autn:term document_occurrences="4060" total_occurrences="9237">STATE</autn:term>
      <autn:term document_occurrences="4740" total_occurrences="8335">TIME</autn:term>
      <autn:term document_occurrences="3862" total_occurrences="7939">US</autn:term>
      <autn:term document_occurrences="4801" total_occurrences="7606">TWO</autn:term>
      <autn:term document_occurrences="3936" total_occurrences="7587">NAME</autn:term>
      <autn:term document_occurrences="3133" total_occurrences="7485">UNIT</autn:term>
...

where,

document_occurrences

is the number of documents that the term occurs in.

total_occurrences

is the total number of times that the term occurs in all documents in IDOL Server.

TIP:

By default, TermGetAll returns the stemmed terms. To return the unstemmed forms, you can set the Stemming parameter to False in the TermGetAll action. See Stemming With Term Actions.

Find the Documents with the Most or Least Terms

You can use the FieldText parameter of an action along with document metadata fields to restrict the documents returned. For example:

action=Query&FieldText=EQUAL{1}:autn_distincttermsperdoc

This action returns the documents that contribute to the count in the <autn:distincttermsperdoc logn="1"> tag.

action=Query&FieldText=EQUAL{10}:autn_termsperdoc

This action returns the documents that contribute to the count in the <autn:termsperdoc logn="10"> tag.

You can use this method to investigate potentially corrupted documents. For example, documents in the <autn:distincttermsperdoc lognx="20"> segment have roughly 220 (over one million) distinct terms. The Oxford English Dictionary second edition has approximately 600,000 definitions, so an English document in this segment has many more distinct terms than there are English words. This document might contain a high number of garbage terms, or numeric and alphanumeric terms.

At the other end, documents in the <autn:distincttermsperdoc lognx="0"> segment have only one distinct term. This segment might also indicate a problem, or missing data.

 


_FT_HTML5_bannerTitle.htm