Regular Expressions

This section describes the regular expressions syntax that Eduction supports.

The engine’s parser interprets regular expression syntax nearly identically to the UNIX regular expression syntax. The engine’s regular expression syntax also includes some extensions for matching substrings.

Operators

The following table the base regular expression operators available in the Eduction engine and the pattern the operator matches.

Operator

Matched Pattern

\

Quote the next metacharacter.

^

Match the beginning of a line.

$

Match the end of a line.

.

Match any character (except newline).

|

Alternation.

()

Used for grouping to force operator precedence.

[xy]

The character x or y.

[x-z]

The range of characters between x and z.

[^z]

Any character except z.

NOTE:

For performance reasons, HPE recommends that you explicitly list all the characters that you want to match, rather than using this operator.

Quantifiers

Operator

Matched Pattern

*

Match 0 or more times.

+

Match 1 or more times.

?

Match 0 or 1 times.

{n} Match exactly n times.
{n,} Match at least n times.
{n,m} Match at least n times, but no more than m times.

Metacharacters

Operator

Matched Pattern

\t

Match tab.

\n

Match newline.

\r

Match return.

\f

Match formfeed.

\a

Match alarm (bell, beep, and so on).

\e

Match escape.

\v

Match vertical tab.

\021

Match octal character (in this example, 21 octal).

\xF0

Match hex character (in this example, F0 hex).

\x{263a}

Match wide hex character (Unicode).

\w

Match word character: [A-Za-z0-9_].

\W

Match non-word character: [^A-Za-z0-9_].

\s Match whitespace character. This metacharacter also includes
\n and \r: [ \t\n\r].
\S

Match non-whitespace character: [^ \t\n\r].

\d Match digit character: [0-9].
\D Match non-digit character: [^0-9].
\b Match word boundary.
\B Match non-word boundary.
\A Match start of string (never match at line breaks).
\Z Match end of string. Never match at line breaks; only match at the end of the final buffer of text submitted for matching.

Extensions

Operator

Matched Pattern

(?A:entity)

Match a previously defined entity, which is then copied into the new entity’s definition.

For example:

<include path="number_types_eng.ecr"/>
    <entity name="fracpos" type="private">
       <pattern>(?A:number/fracalpha/eng)</pattern>
    </entity>

Copying an entity improves pattern execution speed, but increases compilation time and memory usage. It is recommended unless the copied entry is large and is copied multiple times.

(?A^entity)

Match a previously defined entity, which is then referenced by the new entity.

Referencing an entity minimizes the size and memory usage of the grammar, but decreases performance. The performance impact can vary from unnoticeable to significant, depending on the size and structure of the grammar.

(?A!expr)

Match the expression expr but exclude its output. Designates an expression that helps identify an entity, but is not part of it.

For example:

<grammars>
   <grammar name="person">
      <entity name="age" type="public">
         <pattern>(?A!Age:\s)[1-9][0-9]?</pattern>
      </entity>
   </grammar>
</grammars>

If this grammar is used to search the text

   Name: Simon. Age: 32. Address. 12 Fifth Street, Las Vegas.

the text 32 is returned but 12 is ignored because it does not have the prefix “Age:”, which is matched upon but excluded from the output.

(?A=component:expr)

Define a component within an entity’s definition. A component is a named part of an entity.

For example, the following grammar defines areacode and main as components:

<grammars>
   <grammar name="number">
      <entity name="phone" type="public">
         <pattern>(?A=areacode:[0-9]{3})-(?A=main:[0-9]{3}-[0-9]{4})</pattern>
      </entity>
   </grammar>
</grammars>

If the data is as follows

   The phone number is 408-555-1342.

and the following configuration options are set

   <OutputSimpleMatchInfo>false</OutputSimpleMatchInfo>
   <EnableComponents>true</EnableComponents>

then the output displays the areacode value 408 and the main value 555-1342 separately.

Token Properties

CAUTION:

Token properties will be deprecated in a future release. Users should use the equivalent explicit regular expresions instead of token properties.

Operator

Match Pattern

(?A:{properties})

Matches a token that satisfies the list of properties provided. The properties are specified in a comma-separated list of one or more of the following:

  • num, alpha_num
  • all_caps, mixed_case, capword

Any of these properties can be prefixed with the negation operator '!' for exclusion.


_HP_HTML5_bannerTitle.htm