Find Repeated Matches
A repeated match occurs when Eduction finds another match (exactly the same text), at a different location in the input (a different offset). For example, if you are searching some input text for telephone numbers, Eduction could find a match "564123". The same text could occur later in the document and would result in a repeated match. Repeated matches might belong to the same entity, but this is not always the case because multiple entities can match the same text.
TIP: Eduction does not return repeated matches if you configure the engine with EnableUniqueMatches set to TRUE
.
Eduction normally returns matches in the order that they appear in the input, but you might prefer to process a match, followed by all of its repeated matches, and then return to the next unique match. In extreme cases, where the matched text is repeated many times, this provides a convenient way to stop processing and move on to the next unique match or maybe even the next document.
Each Eduction API provides a way to find the next repeated match. The SDK includes sample programs, in each language, that demonstrate this functionality.
NOTE: This feature is not supported for streaming input or in table mode.
C API
In the C API, instead of calling EdkGetNextMatch
you can call EdkGetNextRepeatedMatch
.
If the input contains another match with the same text as the current match, you can then call EdkGetRepeatedMatchByteOffset
and EdkGetRepeatedMatchCodepointOffset
to establish the location of the repeated match.
You can call EdkGetNextRepeatedMatch
repeatedly. If the matched text does not occur again, Eduction returns EdkNoMatch
and you can proceed to the next unique match by calling EdkGetNextMatch
.
Any repeated matches that you access using EdkGetNextRepeatedMatch
are not returned by subsequent calls to EdkGetNextMatch
. By using EdkGetNextRepeatedMatch
you are changing the order in which the matches are returned.
The following code sample demonstrates how you might use these functions. For more information about these functions, refer to the API reference documentation.
while (EdkGetNextMatch(session) == EdkSuccess)
{
// call match accessors and do something with the information
while (EdkGetNextRepeatedMatch(session) == EdkSuccess)
{
size_t nRepCodepointOffset = 0;
size_t nRepByteOffset = 0;
EdkGetRepeatedMatchCodepointOffset(session, &nRepCodepointOffset);
EdkGetRepeatedMatchByteOffset(session, &nRepByteOffset);
// do something with the offsets...
}
}
Java API
The repeated matches functionality in the Java API is similar to the C API. You can iterate over repeated matches as shown in the following code sample. You can call the getByteOffset()
and getCodepointOffset()
methods to establish the position of a repeated match.
for (EDKMatch match : session)
{
// call match accessors and do something with the information
for (EDKRepeatedMatchOffset repeat : match)
{
long byteOffset = repeat.getByteOffset();
long codepointOffset = repeat.getCodepointOffset();
// do something with the offsets
}
}
.NET API
The repeated matches functionality in the .NET API is similar to the C API. You can iterate over repeated matches as shown in the following code sample. The properties ByteOffset
and CodepointOffset
provide the position of a repeated match.
foreach (IExtractionMatch match in session)
{
// call match accessors and do something with the information
foreach (IExtractionRepeatedMatch repeat in match.RepeatedMatches)
{
long byteOffset = repeat.ByteOffset;
long codepointOffset = repeat.CodepointOffset;
// do something with the offsets
}
}