Introduction
Pre-filtering allows you to narrow down the amount of input text that Eduction processes for a particular set of entities. With pre-filtering, Eduction performs an initial quick matching step that finds sections of text that contain likely matches, rather than running the full match on the whole input.
Pre-filtering text can improve performance for some entities, when there is a broad way to find a potential match without either over-matching too much of the input text, or eliminating potential valid matches.
The quick matching step can either match text by using a regular expression (regex) that you configure, or a dictionary of terms.
For example:
- To match addresses, you can use regex pre-filtering to find numbers in the text (which might correspond to house numbers or postal codes).
- To match names in CJKVT languages (where there is a regular set of surnames, and the Eduction grammars do not attempt to find values that are not already listed as names), you might use a dictionary pre-filter. In this case you perform a quick match to find the surnames, and then the full match finds the full name.
The pre-filtering method is less useful for entities that match a long list of possible words, when there is no simple regular expression or dictionary of terms that matches all your possible entities. For example, for English names the Eduction grammars attempt to match plausible names as well as recognized ones, so there is no way to pre-filter without eliminating potential matches.