Defining Mainframe Transaction's Signature with an XML Schema; How To Convert Cobol Metadata

Converting Cobol metadata into an XML Schema using regular expressions processing

In a nutshell, we can define the XML Schema using primitive data types and derived data types defined using primitive or other derived data types. The primitive data types can be of any of the standard formats (for our application we will use just string and integer).

Simple datatypes are declared with the <simpleType> element and include the following basic attributes: name, base type, and they can contain a valid constraining facet. Complex datatypes are declared with the <complexType> element and they are defined by extension or restriction based on other datatypes.

Instead of referring a datatype defined in another portion of the same schema, derived data types can also nest datatype definitions, one inside the other as in:

<element name="COURSES"><complexType><sequence>
<element name="COURSE-ID"><complexType><sequence>
<element name="COURSE-TYPE"><simpleType><restriction
<length value="04"/></restriction></simpleType></element>
<element name="SERV-LENGTH"><simpleType><restriction
<totalDigits value="05"/></restriction></simpleType>
Even when this kind of nested definition is less clear than the ones that use references, it will be useful for automating the generation of the XML Schema from the copybook as we will see soon. For a full description of the XML Schema obtained from the Cobol copybook see Listing 3.

Regular Expressions 101
In order to convert from Cobol to XML Schema we need to recognize certain patterns. For example we can build a rule saying that each group item in Cobol will correspond to a complexType in the schema, or that each elementary item containing a PIC clause will correspond to a simpleType. A useful artefact to recognize patterns in a text file is called a regular expression.

Regular expressions, called also regex, are used in several UNIX utilities and languages (Perl, awk, etc.). Regex allows us to locate a specific pattern or a particular sequence of characters in a string. This combination of characters is defined using a rather powerful syntax.

Regular expressions are built around the use of special characters that are matched against the actual string. These special characters allow us to create a template against which each portion of the compared text is matched and processed in a certain mode.

For example, the regular expression ^.PIC * will match a string starting (^.) with just one character followed by the string "PIC" and followed by 0 or more blanks (will match APIC, BPIC__, but will not match CCPIC - two characters before PIC- or PIC - no character before PIC-). As seen in this example, special characters play an essential role in regex definitions. The Table 1 introduces the most common special characters used in regex.

Even when this is a very basic list of special characters it will suffice for our project. For a more extended information about regular expressions see the reference section.

The Project
In order to convert a copybook into an XML Schema I defined some rules of conversion. To simplify the scope of this project I will leave out some Cobol artefacts such as arrays, and I will centralize my attention on the basic structure of the Cobol metadata. For homework you can try afterwards to extend the code in order to include these structures.

As said before, Cobol organizes the metadata in levels. To produce an XML Schema representation I will convert any level not including a PIC clause (that is any level that doesn't define a basic field) in a complexType. As one level usually includes other levels nested inside, I will nest the complexType definitions to mimic the Cobol definition, using the syntax seen in the XML Schema section.

The corollary of this rule is that any definition including a PIC clause will be considered a simpleType. We will use the length as a restriction in the definition of the field.

The Cobol example seen in the first paragraph:

can be translated then, as a complexType called COURSES that is composed of one complexType COURSE-ID and a simpleType COURSE-NAME. COURSE-ID is composed, in turn, of two simpleType fields: COURSE-TYPE and COURSE-NUMBER.

So with these two simple rules I can try to produce the schema. Now I will explain the tool we will use to achieve this objective.

The Program
In order to automate the conversion of the XML Schema I coded a java program that uses regular expressions to do the job. The java program reads the file containing the copybook, matches record by record against a pattern defined by a regex, and then produces a schema definition in another file. Since the definitions are usually nested, we need to keep some track of levels opened in order to produce the closing tags (</complexType>, </element>, etc.).

The program uses a set of classes included in Jakarta (mainly under org.apache.oro.text). These classes give us the basic functionality to search based on regular expressions:

import org.apache.oro.text.awk.*;
import org.apache.oro.text.regex.MalformedPatternException;
import org.apache.oro.text.regex.Pattern;

The regex functionality is provided by the three classes: Pattern, AwkMatcher, and AwkCompile. AwkCompile allows compiling a regex as in:

Pattern pattern = compiler.compile("(\\sPIC)|(\\sVALUE)|
(^ *$)|(\\sCOPY\\s)");

The compiled pattern can be used afterwards to match against a string (contained here in an irecord variable) using an AwkMatcher object:


Edgardo Burin works for ING Canada as a solution architect in integration projects using webMethods. He works in different projects integrating mainframe transactions, MQ services, and Oracle databases using webMethods. He has more than 10 years of experience managing infrastructure. His areas of expertise are in Oracle databases, integration, and service-oriented architecture.

