PCI Grammar Customization

In cases where you find that the PCI grammars miss particular matches in your input, you can customize them. This section describes the possible customizations.

The following grammars support customization:

  • name.ecr

  • name_cjkvt.ecr

NOTE: It is technically possible to extend any public entity in a PCI grammar, but it can involve a lot of work. If you want to extend an entity that is not listed in the following list, see Modify Other Grammars and Entities.

For each grammar that supports customization, you can customize the following entities:

  • name

    • pci/name/surname/nocontext/CC

    • pci/name/given_name/nocontext/CC

  • name_cjkvt

    • pci/name/surname/nocontext/latin/CC

    • pci/name/surname/nocontext/cjkvt/CC

    • pci/name/surname/nocontext/cjkvt_spaced/CC

    • pci/name/given_name/nocontext/latin/CC

    • pci/name/given_name/nocontext/cjkvt/CC

    • pci/name/given_name/nocontext/cjkvt_spaced/CC

In this list, CC means country code (for example: gb, us, nz). See Country and Language Support.

You can use customizations to add entries that the existing entities do not match (such as unusual names). You might also use it if your data uses unusual separators and punctuation. The following section provides an example of these changes.

TIP: When you customize an entity, you can either replace or extend the definition. For PCI grammars, OpenText recommends that you only extend the entity definitions.

If you replace an entity, you are likely to miss matches or reduce performance. In addition, existing definitions cover many match cases that you might not consider, so there is a lot of value in using these definitions as a base.

TIP: When you add names to the name list grammars, OpenText recommends that you use the following scores:

5.0 The most common names.
2.05 Less common, but still frequently used names.
1.05 Rare or uncommonly-used names.

Example 1: New Name and Custom Separator

The following example shows how to add a new given name and surname to an entity. It also shows how to declare patterns with custom separators. For example, if your input data contains unusual spacing or characters between entities, you can declare these in your entity extensions.

The following grammar definition extends name.ecr.

name_extended.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grammars SYSTEM "../published/edk.dtd">
<grammars version="4.0">
   <include path="name.ecr"/>
   <grammar name="pci/name">

      <entity name="given_name/nocontext/gb" extend="append" case="insensitive">
         <entry headword="Fobo" score="2"/>
      </entity>

      <entity name="surname/nocontext/gb" extend="append" case="insensitive">
         <entry headword="Jobo" score="2"/>
      </entity>

      <entity name="gb" extend="append">
         <pattern>(?A=SURNAME:(?A:surname/nocontext/gb))@@(?A=FORENAME:(?A:given_name/nocontext/gb))</pattern>
      </entity>

   </grammar>
</grammars>

This declaration makes two changes:

  • It adds new entries for given_name and surname. This change allows Fobo Jobo to match as a name for the gb entity.

  • It declares a new pattern for the gb entity, to match a name in reverse order, with the elements separated by a custom separator (two @ symbols). This change allows Jobo@@Fobo to match as a name.

TIP: The grammar already handles hyphenated known names. For example, after this definition change, Eduction matches Fobo-Fobo Jobo with a score of 1, with no further changes required. You do not need to add hyphenated entries to the given_name/nocontext or surname/nocontext entities.

Example 2: New Names for CJKVT Grammar

The following example adds a new CJKVT and latin name, and adds tabs as a custom separator.

Example extension XML:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE Grammars SYSTEM "edk.dtd">
<grammars version="4.0">
   <include path="name_cjkvt.ecr"/>
   <grammar name="pii/name">
      <entity name="given_name/nocontext/cjkvt/jp" extend="append">
         <entry headword="亮美" score="1.05"/>
      </entity>

      <entity name="surname/nocontext/cjkvt/jp" extend="append">
         <entry headword="電話" score="1.05"/>
      </entity>

      <entity name="given_name/nocontext/latin/jp" extend="append">
         <entry headword="Fobo" score="2.05"/>
      </entity>

      <entity name="surname/nocontext/latin/jp" extend="append">
         <entry headword="Jobo" score="2.05"/>
      </entity>

      <entity name="jp" extend="append">
         <pattern>(?A=SURNAME:(?A:surname/nocontext/cjkvt/jp))\t(?A=FORENAME:(?A:given_name/nocontext/cjkvt/jp))</pattern>
      </entity>
   </grammar>
</grammars>

This declaration makes two changes: 

  • It extends the lists of known CJKVT and Latin names for Japan, allowing 電話亮美 to match as a CJKVT full name, and Fobo Jobo to match as a Latin name.

  • It adds a new full name format, allowing tab-separated surname+given name to match.

Compile Custom Grammars

As with any Eduction grammar, OpenText recommends that you compile your grammar extensions before using them. You can use the edktool command-line tool to compile the XML file that contains your extension declarations into an ECR file.

For more information about compiling custom grammars, refer to the Eduction User and Programming Guide.

Modify Other Grammars and Entities

It is possible to extend any public entity in a PCI grammar. However, you cannot use the various private entities that the public ones use in their definitions.

For entities in the simpler grammars such as driving or national ID, this might be less of a problem, as long as you know the format for the data portion of this entity. For example, you might want to add new landmarks to these entities, for example.

However, be aware that existing definitions account for factors such as varying spaces, and additional words between the landmark and the data. In this case, you must emulate this behavior in your extensions, which might take a lot of work.

In practice, OpenText recommends that you make a support request to make these changes to the official PCI grammars, unless you need to add support in a very short time frame. The existing definitions provide a lot of value because they cover so many match cases, and you might miss these cases when you extend the public entities where these definitions are not available.