Document Sections
In many situations when indexing documents into IDOL Server, it is advantageous to break long documents up into sections. Typically, section breaking is done during the Import process by the Connectors so that the IDX subsequently indexed into IDOL Server might consist of multiple sections for each document.
The aim of section breaking is to create sections where the textual content is a prescribed number of characters in length (typically around 5 kB). In practice, the Connectors attempt to break the documents at the end of the paragraph or page closest to the desired number of characters.
When IDOL Server indexes a sectioned document, each section is effectively an entirely separate document, with its own docid
(or autn:id
in the XML response of a query). To indicate that the sections all derive from the same original document, each section has the same baseid
(or autn:baseid
), which corresponds to the docid
of the first section.
NOTE: All sections of a document have the same values of all fields, except for the section-broken DRECONTENT
field itself.
The maximum number of sections that a single document can be broken into is 65,535.
IDX Format
In IDX format, the SectionBreakType
field indicates that a document is one of multiple sections. In IDX, this is almost always the DRESECTION
field.
#DREREFERENCE Reference123 #DRETITLE Title of the document #DRESECTION 0 #DRECONTENT The first part of the content of the document... #DREENDDOC #DREREFERENCE Reference123 #DRETITLE Title of the document #DRESECTION 1 #DRECONTENT The second part of the content of the document... #DREENDDOC
The Effect of Sections on Queries
Section breaking can change the results from different queries. The following examples show the different results for two identical IDOL Servers that contain identical documents, but where Server 1 does not have section breaking, and Server 2 does have section breaking.
Example 1
A long document contains the word match throughout its content. The following query:
action=Query&Text=Match
returns one result from Server 1, but multiple results from Server 2 (each corresponding to one section).
TIP: To return only the highest-placed section from each document in Server 2, set the Combine
parameter to Simple
.
Example 2
A long document contains the word alpha at the start, and zulu at the end. The following query:
action=query&Text=alpha+AND+zulu
only matches the document in Server 1.
Example 3
A long document talks about alligators on the first page, and zebras on the last. The following Suggest
action to find documents similar to a separate document about zebras:
action=Suggest&Reference=document_about_zebras
only matches (the final section of) the document in Server 2 (or at least it will match the document in Server 1 only with a very low weight). This is because sectioning documents enables IDOL Server to obtain more tightly-defined views of what a document is about.
Recommendations
Other than strict keyword-matching systems such as e-discovery, you can usually obtain better query performance by turning section breaking on, and Micro Focus recommends this as the default option.
The disadvantage of sectioning is that the total number of documents in a server might be significantly larger if sectioning is on, which in turn can affect query performance and increase memory usage.
Section Breaking of XML
You can also use section breaking for long fields in XML content. However, unlike for IDX, the section breaking occurs during indexing into IDOL Server. You can control section breaking by using the following configuration setting:
[SectionBreaking] MaxSectionLength=5000
Unlike for IDX, only the section that contains any metadata fields returns in a query for those metadata fields.