COBOL as XML

What does XML look like? Start with the assumption that it is a textual encoding of COBOL data (although this is not quite accurate, it is sufficient for now). Suppose you have the following COBOL definition in the Working-Storage Section:

01 contact.
   10 firstname pic x(10) value "John".
   10 lastname pic x(10) value "Doe".
   10 address.
      20 streetaddress pic x(20) value "1234 Elm Street".
      20 city pic x(20) value "Smallville".
      20 state pic x(2) value "TX".
      20 postalcode pic 9(5) value "78759".
   10 email pic x(20) value "jd@aol.com".

What does this information look like if you simply WRITE it out to a text file? It looks like this:

John Doe 1234 Elm Street Smallville TX78759jd@aol.com

You can see that all the "data" is here, but the "information" is not. If you received this, or tried to read the file and make sense out of it, you would need to know more about the data. Specifically, you would have to know how it is structured and the sizes of the fields. It would be helpful to know how the author named the various fields as well, since that would probably give you a clue as to the content.

This is not a new problem; it is one that COBOL programmers (as well as other application programmers) have had to deal with on an ad hoc basis since the beginning of the computer age. But now, XML gives us a way to encode all of the information in a generally understandable way.

Here is how this information would be displayed in an XML document:

<contact>
   <firstname>John</firstname>
   <lastname>Doe</lastname>
   <address>
      <streetaddress>1234 Elm Street</streetaddress>
      <city>Smallville</city>
      <state>TX</state>
      <postalcode>78759</postalcode>
   </address>
   <email>jd@aol.com</email>
</contact>

In XML, the COBOL group-level item is coded in what is called an "element." Elements have names, and they contain both text and other elements. As you can see, an XML element corresponds to a COBOL data item. In this case, the 01-level item "contact" becomes the <contact> element, coded as a start "tag" ("<contact>") and an end tag (" </contact>") with everything in between representing its "content." In this case, the <contact> element has as its content the elements <firstname>, <lastname>, <address>, and <email>. This corresponds precisely to the COBOL Data Division declaration for "contact." Similarly, the 10-level group item, "address", becomes the element <address>, made up of the elements <streetaddress>, <city>, <state>, and <postalcode>. Each of the COBOL elementary items is coded with text content alone. Notice that in the XML form, much of the semantic information is missing from the raw COBOL output form of the data. As a bonus, you no longer have the extraneous trailing spaces in the COBOL elementary items, so they are removed. In other words, the XML version of this record contains both the data itself and the structure of the data.

Now, what if the COBOL data had looked like the following:

01 contact.
   10 email pic x(20)
   10 firstname pic x(10).
   10 lastname pic x(10).
   10 address.
      20 city pic x(20).
      20 state pic x(2).
      20 postalcode pic 9(5).
      20 streetaddresslines pic 9.
      20 streetaddresses.
         30 streetaddresses occurs 1 to 9 times
             depending on streetaddresslines pic x(20).

Two things have changed in this example: the initial values have been removed and there can now be up to nine "streetaddress" items. This is much more similar to what you might expect in a real application. After the application code sets the values of the various items from the Procedure Division, the XML coding of the result might look like this:

<contact>
    <email>bs@aol.com</email>
    <firstname>Betty</firstname>
    <lastname>Smith</lastname>
    <address>
          <city>Galesburg</city>
          <state>IL</state>
          <postalcode>61401</postalcode>
          <streetaddresslines>3</streetaddresslines>
          <streetaddresses>
           <streetaddress>Knox College</streetaddress>
           <streetaddress>Campus Box 9999</streetaddress>
           <streetaddress>2 E. South St.</streetaddress>
          </streetaddresses>
    </address>
</contact>

Notice the repeating item "streetaddress" has become three <streetaddress> elements. In this example, COBOL acts as an XML programming language, providing both the structure (schema) of the data and the data itself.

Even though these examples are very simple, they illustrate how powerful the compatibility between the COBOL data model and the XML information model can be. COBOL structures of arbitrary complexity have a straightforward XML representation. There are, it turns out, some things that you can specify in a COBOL data definition that cannot be coded as XML, but these can easily be avoided if you are programming your application for XML.