Extract Entities from Tables

Eduction Table mode allows you to extract entities from a table, according to the values in the header of that table. This process allows you to target extraction on likely values in structured data, rather than extracting every possible entity value from a table. It can also improve the confidence that an ambiguous entity value corresponds to a particular type of data.

In standard extraction, Eduction searches text for a value that matches a particular entity. In many cases the entity values are distinctive, and so you can be reasonably confident that matches are relevant. For example, a string that matches an address entity is unlikely to be anything else.

Many other entity values are potentially ambiguous. For example, a number might match several entity types, and a date might be a date or birth or an event date. Without further information, it is difficult to determine whether these values are useful.

For unstructured text, you can use landmarks to find relevant information. Landmarks are values that identify a particular entity, without being a part of the entity value. For example, the phrase Date of Birth is a landmark. When a document contains the value Date of Birth: 06/07/80, it is highly likely that the date is a date of birth, and you can treat the data accordingly.

NOTE: The IDOL PII Package, IDOL PHI Package, and IDOL PCI Package, provide landmark entities in most grammars. To extract entities from tables with the Eduction standard grammar files, you might need to create your own landmark entities.

For structured data, it is less likely that the landmark occurs next to the entity. You might have the value Date of Birth in a table heading, and the actual date values in the rows below. In this case, you can use table extraction to extract the values that correspond to the landmark.

Configure Table Extraction

In table extraction, you define an entity or entities that you want to detect in the header row, and entities that you want to detect in the cells under that header. When Eduction matches one of these entities in the header row of a table, it attempts to extract the corresponding cell entities from the cells in that column.

To configure these, you use the HeaderEntityN and CellEntityN configuration parameters.

For example:

[Eduction]
HeaderEntity0=pii/date/dob/landmark/all
CellEntity0=pii/date/nocontext/all

This example matches date of birth landmark values in the header, and for all subsequent rows in that column, it extracts any date values.

NOTE: You can specify multiple entities, either by providing a comma-separated list, or by using wildcard characters. In this case, if the table header matches any of the configured header entities, Eduction matches the cell content against any of the configured cell entities.

This option might be useful if you want to match a particular entity in multiple languages, or if you want to include a custom entity in addition to a standard one.

To use table extraction with Connector Framework Server (CFS) or IDOL NiFi Ingest, you can also add the EntityFieldN parameter. This parameter specifies the field that CFS or NiFi write the extracted entities to in your documents.

In this case, if you do not set EntityFieldN, Eduction uses the value of CellEntityN to create a default field name (the capitalized entity name, with / * and ? characters replaced with underscores).

[Eduction]
HeaderEntity0=pii/date/dob/landmark/all
CellEntity0=pii/date/nocontext/all
EntityField0=DATE_OF_BIRTH

NOTE: You cannot specify EntityFieldN for only some of your CellEntityN values; you must either use the default value for all, or set EntityFieldN for all.

These parameters are the same for extracting entities from CSV or TSV table files, and for structured table data in XML, such as the output from Media Server OCR. For structured XML tables, there is an additional parameter, TableCellPath, for CFS and IDOL NiFi Ingest. TableCellPath describes the structure of the XML to allow Eduction to find the cells. For more information, refer to the Connector Framework Server or NiFi Ingest documentation.

For the Eduction SDK, you do not need to configure TableCellPath, because you use functions to locate the cells.

NOTE: You cannot extract entities from structured XML data in Eduction Server or edktool. In these cases you must use a CSV or TSV table file.

Run Table Extraction

After you configure table extraction, you can run Eduction as normal, with a CSV or TSV table file as input.