Extract

This option extracts entities from a document. It can print the output to a file, or to the console. You can use this option to test your grammars.

You can use wildcard expressions in the -e and -g parameters; see Wildcard Expressions in edktool for more information.

Redact Extraction Results

You can enable redaction on extracted matches in edktool either by setting RedactedOutput to True in the edktool configuration file, or by specifying a redaction file using the -r parameter at the command line. Note that edktool only performs redaction on fields that you have configured as IDOL search fields.

If you have specified an IDX file to perform extraction on, existing fields are preserved in their unredacted form, and a redacted copy of each search field is added to the IDX file, with _REDACTED appended to the original field name. For example:

#DREREFERENCE 1
#DREFIELD DRECONTENT_REDACTED="The driver ########## was questioned."
#DRECONTENT
The driver Joe Bloggs was questioned.
#DREENDDOC

If you have specified a plaintext file to perform extraction on, the entities identified as matches by edktool are redacted from the input text to form the redacted output. For example:

Input:

The driver Joe Bloggs was questioned.

Output:

The driver ########## was questioned.

Eduction sends redacted output to the file specified in the-r parameter. If you do not specify this argument but you have enabled redaction in the configuration file, Eduction displays redacted output in the console after the list of matches, unless you have specified the -q parameter at the command line to enable Quiet mode. In Quiet mode, redacted output does not display in the console.

-l <licensefile>

The file containing a valid license key for Eduction.

If you do not specify a license key at the command line, edktool assumes that the location of the license file is licensekey.dat. If the license is kept in this location, you do not need to specify this parameter.

-i <inputfile>

The file to perform entity extraction on. The input file can be either an IDOL IDX file, an IDOL XML file, or a plain text file. It must be UTF-8 encoded.

NOTE:

If the input file is an XML file, the configuration file (in either IDOL configuration file format or XML format) must contain entries for the DocumentDelimiterCSVs parameter. If this setting is not correct, Eduction might not find any documents in the XML file. For information on how to set this option, refer to the Eduction Parameters.

-c <configfile>

A configuration file controlling the extraction. The configuration file can be either an IDOL Server style .CFG configuration file or an XML configuration file. See Configuration Files for Eduction Settings.

You can specify one or more grammar files and one or more entities in place of a configuration file. Specifying a configuration file overrides the grammar or entity parameters.

-g <grammarfile>

A grammar file to use when -c is not used.

If you provide a grammar file but do not specify any entities with -e, Eduction extracts all entities in the grammar file.

-e <entity>

The entities to extract when -c is not used. Separate multiple entities with a comma.

-o <outputfile>

The file containing the results of the extraction. The content of the optional output file depends on the type of input file provided and whether the -m option is used.

If the input file type is an IDOL file and the -m option is not used, the output file is identical to the input file, except the matched entities are appended to each document as additional fields. This behavior is the same as Eduction running in IDOL.

If the input file is a plain text file or an IDOL file with the -m option, the output file is an XML file containing the matched entities.

If the input file is an IDOL file, the output file also contains document information.

-m Produce match results for IDOL input files.
-q (Optional) Sets “Quiet Mode” so that descriptive messages and redacted output are removed, and the output consists of the XML matchlist only (that is, an XML document with all the matches and any configured metadata).
-r <redaction_file> A copy of the input file, with all matches redacted.For example, if you specified an IDX input file, the content is sent to the redaction file as follows, with the redactions made in place:
#DREREFERENCE 1
#DRECONTENT
The driver ########## was questioned.
#DREENDDOC
-p Set this parameter if you want to use a plaintext grammar file rather than an XML grammar file as the input text to extract from.

The extract option requires an input file (either in IDOL IDX, IDOL XML, or plain text format) and either a configuration file or a grammar file. If you do not provide a configuration file, edktool searches the file for any specified entities in the specified grammar (or all entities, if none are specified). For example, in the simplest command line:

C:\>edktool e -i myData.txt -g grammar1.ecr,grammar2.ecr

edktool is invoked with no configuration file. It uses the command-line arguments to process the data file myData.txt with the grammar files grammar1.ecr and grammar2.ecr. Eduction identifies all the entities in the two grammar files, and matches on these. The output is sent to the console in XML format, identifying matches in the data file and using the entity names to generate field names for the matches that contain the matched data. Assuming myData.txt is a plain text file, the entire body of the file is matched.

You can also specify the -p parameter at the command line to extract matches from a plaintext grammar file.

The plaintext grammar file must be in the format described in Plaintext Grammar File Format .


_FT_HTML5_bannerTitle.htm