Wide Character-String Data (UTF-16 Support)

A wide character-string value is a sequence of double-byte characters or byte-pairs. The number of byte-pairs in a wide character string or sequence is called the length of the string or sequence. In Open PL/I, the maximum length of a wide string value is 16383 byte-pairs or double-byte characters. The byte-pairs are always stored in big-endian format.

A wide character-string of zero length is called a null wide string.

A wide character-string variable or character-string valued function is declared with the following attributes:

WIDECHAR(n)

or

WIDECHAR(n) VARYING

or

CHARACTER(n) VARYINGZ

or

WIDECHAR(n) VARYING BIGENDIAN

where n is an integer valued expression that specifies the maximum length of all wide string values that can be held by the variable or returned by the function. WCHAR can be used as a synonym for WIDECHAR.

The VARYING (VAR) attribute causes the string variable or function to hold or return values of varying lengths. Internally, the length of the varying string is recorded along with the value. Varying strings are not padded to assume their maximum length, n. The representation of a varying string variable in storage is such that any string up to n wide characters may be held by the variable, and the length of the current string is retained as part of the value. The length field is always stored according the native architecture of the machine for fixed binary (15) data items.

The VARYINGZ (VARZ) attribute causes the string variable or function to hold or return values of varying lengths. Internally, the varyingz strings are a stored as a sequence of characters terminated by a ‘0000’x byte, and are not padded to assume their maximum length, n. The representation of a varying string variable in storage is such that any string up to n double-byte characters may be held by the variable.

Without the VARYING attribute, a wide string variable or function always holds or returns values of length n. An assignment to a non-varying string always extends short values with blanks (0x0020) on the right to make them n wide characters long.

Assignments of a string of more than n wide characters to either a VARYING or a non-varying string variable cause only the leftmost n wide characters to be assigned and excess characters to be truncated.

Character-string values are compared from left to right using the binary values of the byte-pairs. Strings of unequal length are compared by effectively extending the shorter string with blanks (0x0020) on the right.

Non-varying character-string variables always occupy exactly n*2 bytes of storage. As elements of arrays or members of a structure, they begin on the next available byte and are not aligned on word or other storage address boundaries. This permits an array of non-varying characters to be stored and accessed as if it were a single wide string.

Varying wide character-string variables always occupy (n*2)+2 bytes of storage. The first 2 bytes contain an integer L (0 <= L <= n) that specifies the length of the wide string value currently stored in the variable. The string text occupies the first L*2 bytes of the storage following the 2-byte length field. The value of the last n-L byte-pairs of the variable is undefined. The value L can be accessed using the LENGTH built-in function. An array of varying wide character strings cannot be accessed as if it were a single wide character string.

Varyingz wide character-string variables always occupy n+1 double bytes of storage. The current length L of a varyingz string is determined by the position of the first '0000'x double-byte, where L = index(string, '0000'Wx) - 1. The string text occupies the first L double-bytes of the storage following the '0000'x double-byte terminator. The value of the last n-L double-bytes of the variable is undefined. The value L can be accessed using the LENGTH built-in function. An array of varyingz wide character-strings cannot be accessed as if it were a single character string.

See the Open PL/I User's Guide for specific alignment of varying character strings.

Note: Any string produced as an intermediate result requires storage allocated either on the stack. System area storage is required for the ALLOCATE statement. Allocating a large temporary storage block that exceeds the amount of available storage results in a signal of the ERROR condition.

Restrictions

  • All WIDECHAR byte-pairs are stored in Big Endian format.
  • WIDECHAR characters in source files is not supported.
  • W string constants is not supported.
  • WIDECHAR expressions in stream I/O is not supported.
  • Implicit conversions to/from WIDECHAR in record I/O are not supported.
  • Implicit endian-ness flags in record I/O are not supported.
  • Only Big Endian byte-pairs within WIDECHAR files is supported.
  • WIDECHAR(x) VARYINGZ is not supported.
  • Surrogate pairs are not supported.
  • UTF-8 is not supported.
  • The WIDECHAR data type is not supported when using the -ebcdic compiler option.

Wide Character String Constants

A wide character-string constant must be written as a Wide Character String Hexadecimal Literal. For example:

Dcl wstr widechar (2) init   (‘039103AA’Wx) ;  
/* Greek upper case ALPHA OMEGA */

For wide character string literals with all byte-pair binary values < 0x0080 (ASCII collating sequence 0 - 127 hex), character string literals may be used, and the compiler will do the appropriate internal conversion to type WIDECHAR. For example:

Dcl wstr widechar (32) varying init (‘Hello Widechar!!’);