Term analysis of a document or set of documents returns statistical information on the terms in the documents. The information can in turn be used to perform conceptual operations, and it is integral in additional functionality such as categorization, clustering, and profiling.
The TermGetAll
action allows you to list the terms that are stored in IDOL Server. You can also view information about terms in the indexed data on the Terms tab on the Performance page in the Monitor section of IDOL Admin.
You can also return additional information about the terms in your documents:
You can return details about the types of terms that occur in IDOL Server, by setting the TermAnalysis
parameter to True
in the TermGetAll
action.
TermAnalysis
returns the following information, in <autn:termanalysis>
tags:
Detail | Description |
---|---|
Terms
|
The total number of terms in IDOL Server. |
Numeric
|
The number of purely numeric terms. For example 13570. |
Alphanumeric
|
The number of alphanumeric terms. This count excludes purely numeric terms. For example A739B, 12a. |
Multibyte
|
The number of terms that include at least one multibyte character. For example, café or ψάρι. |
Dococcs logn
|
The number of terms that occur in between
|
Length len
|
The number of terms of each length. NOTE:
This length counts a multibyte character as length 1. For example, the word café is counted as length 4, even though the UTF-8 string is 5 bytes. |
To return only the analysis information, send TermGetAll
with TermAnalysis
set to True
, and MaxTerms
set to 0
. For an example response, see Example TermGetAll TermAnalysis Response.
You can use the TermGetAll
action to find the most common terms in IDOL Server.
Set Type
to TrueOccs
to return terms in order of the number of times they occur in IDOL Server. The most common terms are listed first.
For example:
action=TermGetAll&Type=TrueOccs&MaxTerms=100
This action returns the 100 most common terms in IDOL Server. For example:
<autnresponse> <action>TERMGETALL</action> <response>SUCCESS</response> <responsedata> <autn:number_of_terms>3675990</autn:number_of_terms> <autn:term document_occurrences="6117" total_occurrences="11798">ON</autn:term> <autn:term document_occurrences="4973" total_occurrences="11690">NEW</autn:term> <autn:term document_occurrences="5387" total_occurrences="9964">FIRST</autn:term> <autn:term document_occurrences="4923" total_occurrences="9637">YEAR</autn:term> <autn:term document_occurrences="4060" total_occurrences="9237">STATE</autn:term> <autn:term document_occurrences="4740" total_occurrences="8335">TIME</autn:term> <autn:term document_occurrences="3862" total_occurrences="7939">US</autn:term> <autn:term document_occurrences="4801" total_occurrences="7606">TWO</autn:term> <autn:term document_occurrences="3936" total_occurrences="7587">NAME</autn:term> <autn:term document_occurrences="3133" total_occurrences="7485">UNIT</autn:term> ...
where,
|
is the number of documents that the term occurs in. |
|
is the total number of times that the term occurs in all documents in IDOL Server. |
By default, TermGetAll
returns the stemmed terms. To return the unstemmed forms, you can set the Stemming
parameter to False
in the TermGetAll
action. See Stemming With Term Actions.
You can use the FieldText parameter of an action along with document metadata fields to restrict the documents returned. For example:
action=Query&FieldText=EQUAL{1}:autn_distincttermsperdoc
This action returns the documents that contribute to the count in the <autn:distincttermsperdoc logn="1">
tag.
action=Query&FieldText=EQUAL{10}:autn_termsperdoc
This action returns the documents that contribute to the count in the <autn:termsperdoc logn="10">
tag.
You can use this method to investigate potentially corrupted documents. For example, documents in the <autn:distincttermsperdoc lognx="20">
segment have roughly 220 (over one million) distinct terms. The Oxford English Dictionary second edition has approximately 600,000 definitions, so an English document in this segment has many more distinct terms than there are English words. This document might contain a high number of garbage terms, or numeric and alphanumeric terms.
At the other end, documents in the <autn:distincttermsperdoc lognx="0">
segment have only one distinct term. This segment might also indicate a problem, or missing data.
|