Unicode Support

Unicode is the general name given to the collection of character encoding schemes that can be used to represent most national languages. The main encodings defined under Unicode are UTF-8, UTF-16 and UTF-32 (where UTF stands for Unicode Transformation Format).

The COBOL development system uses UTF-16 as its representation for national characters. This means that each national character takes up 2 bytes of memory, and there are 65536 possible character combinations available.

The UTF-8 encoding scheme is a variable-width Unicode encoding. Each valid Unicode code point is encoded using one to four 8-bit bytes. UTF-8 is a popular encoding scheme, as it is backward-compatible (with ACSII); it is endianness independent; and it is often provides a more compact representation of Unicode than compared to UTF-16.

COBOL applications are increasingly moving towards the use of Unicode to manage application data from multiple national languages in a single application. The COBOL language support for Unicode is in compliance with the ISO2002 COBOL standard. National data can be stored in standard COBOL files or in an SQL database. In addition, the Unicode support can be used to pass national character strings directly between COBOL and Java. In DBCS environments, it can also be used side-by-side with the existing DBCS support, making it possible to incrementally transition existing DBCS applications over to Unicode as needed.