Document Sections

In many situations when indexing documents into IDOL Server, it is advantageous to break long documents up into sections. Typically, section breaking is done during the Import process by the Connectors so that the IDX subsequently indexed into IDOL Server might consist of multiple sections for each document.

The aim of section breaking is to create sections where the textual content is a prescribed number of characters in length (typically around 5 kB). In practice, the Connectors attempt to break the documents at the end of the paragraph or page closest to the desired number of characters.

When IDOL Server indexes a sectioned document, each section is effectively an entirely separate document, with its own docid (or autn:id in the XML response of a query). To indicate that the sections all derive from the same original document, each section has the same baseid (or autn:baseid), which corresponds to the docid of the first section.

NOTE:

All sections of a document have the same values of all fields, except for the section-broken DRECONTENT field itself.

The maximum number of sections that a single document can be broken into is 65,535.

IDX Format

In IDX format, the SectionBreakType field indicates that a document is one of multiple sections. In IDX, this is almost always the DRESECTION field.

#DREREFERENCE Reference123
#DRETITLE Title of the document
#DRESECTION 0
#DRECONTENT
The first part of the content of the document...
#DREENDDOC
#DREREFERENCE Reference123
#DRETITLE Title of the document
#DRESECTION 1
#DRECONTENT
The second part of the content of the document...
#DREENDDOC

The Effect of Sections on Queries

Section breaking can change the results from different queries. The following examples show the different results for two identical IDOL Servers that contain identical documents, but where Server 1 does not have section breaking, and Server 2 does have section breaking.

Example 1

A long document contains the word match throughout its content. The following query:

action=Query&Text=Match

returns one result from Server 1, but multiple results from Server 2 (each corresponding to one section).

TIP:

To return only the highest-placed section from each document in Server 2, set the Combine parameter to Simple.

Example 2

A long document contains the word alpha at the start, and zulu at the end. The following query:

action=query&Text=alpha+AND+zulu

only matches the document in Server 1.

Example 3

A long document talks about alligators on the first page, and zebras on the last. The following Suggest action to find documents similar to a separate document about zebras:

action=Suggest&Reference=document_about_zebras

only matches (the final section of) the document in Server 2 (or at least it will match the document in Server 1 only with a very low weight). This is because sectioning documents enables IDOL Server to obtain more tightly-defined views of what a document is about.

Recommendations

Other than strict keyword-matching systems such as e-discovery, you can usually obtain better query performance by turning section breaking on, and Micro Focus recommends this as the default option.

The disadvantage of sectioning is that the total number of documents in a server might be significantly larger if sectioning is on, which in turn can affect query performance and increase memory usage.

Section Breaking of XML

You can also use section breaking for long fields in XML content. However, unlike for IDX, the section breaking occurs during indexing into IDOL Server. You can control section breaking by using the following configuration setting:

[SectionBreaking]
MaxSectionLength=5000

Unlike for IDX, only the section that contains any metadata fields returns in a query for those metadata fields.


_FT_HTML5_bannerTitle.htm