|
|
YOUR FEEDBACK
SOA World Conference
Virtualization Conference $200 Savings Expire May 16, 2008... – Register Today! Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Feature
Generating XML Instances from Flat Files
A schema-based approach
Digg This!
Enterprise applications such as banking, healthcare, and so on still use flat files to import/export data between applications. Flat files contain machine-readable data that is typically encoded in printable characters. There is a growing need for these applications to interact with XML-aware applications and Web services, and to satisfy this need these applications must convert flat file data to an XML format. XML is suited for the interchange of data as XML documents are tagged, easily parsed, and can represent complex data structure. The conversion of a flat file to an XML format requires proper representation of the data embedded in the flat file in some template form so that it can be converted to XML. There are custom solutions based on XML templates and XML dtds made to capture the data structure of flat files to be converted to an XML format, but here a new schema-based approach to parse flat files to an XML instance will be discussed. Why XML Schema? Several kinds of commercial software are available that convert flat files to XML instances based on proprietary templates and conversion routines. These solutions are tailored to meet specific needs and do not scale to fit the requirements of generic flat-file-to-XML-instance generation. This approach is based on open standards such as W3C schema, API, and XERCES XML parser's schema implementation. It is suitable for any Java project or custom XML instance- generation project using open source technologies. Process The following steps explain the conversion of a flat file to the XML Schema.
A "," delimited flat file: A fixed-length field flat file: In these flat files, the data and the data structure remain the same but the representation changes depending on the flat file type. Now let's go through the steps to convert these flat files to an XML instance. Figure 1 shows the approach in detail. Data Representation For all the examples, any one of the two schema representations may be adopted. Both representations exhibit the flexibility to describe the data structure using the W3C schema language. In our example the first schema definition is considered. The examples here are simple but the same approach applies to complex cases as well. Parsing Logic Implementation These control attributes are pivotal to parsing the flat file and are defined in the t2xml.IParseProperties interface. This interface is implemented in the class xerces.xs.instance.T2xmlInstance to implement all the parsing logic to populate the XML instance with flat file data discussed in detail in the "Populating instance with data" section. See Listing 2 for the control attributes. Apart from these control attributes, the attributes minOccurs and maxOccurs play a crucial role in determining repetitions of containers (records); depending upon the value of the minOccurs and maxOccurs, the optional and required containers are decided. For example, if minOccurs is "0," the container is optional; if it is more than "0," the container is mandatory. If maxOccurs is "unbounded," the number of containers is decided depending upon the records in the actual flat file. However, if a number is prespecified in the schema, that number of records is anticipated in the flat file. Now let's see how these control attributes are used in the schema definition to mark parsing logic instruction for the flat file. Delimited Flat File Case 777227878,Simi? D Roy,123000.00 This shows an employee record with three fields: Social Security Number (SSN), name, and salary. The name field can be subdivided into first name and last name and separated by a " " delimiter. The "?" is considered an escape character in the name field. Figure 2 shows the basic mapping for a delimited record to a schema definition. In Figure 2 the full record is mapped to the Employee element. Since the record is a delimited one, the following control attributes are added to the Employee element:
For the ssn and salary fields the mapping is simple, as these are contained within the Employee container and do not have any additional contained objects inside. The following control attribute is added for them:
For the name field the mapping is a little complex as it contains the subfields first name and last name. Thus it's a container as well as a contained object itself. The following control attributes should be added for it.
Fixed-length Flat File Case 777227878Simi? D Roy 123000.00 The fixed-length field case is almost the same as the delimited case; the only difference is that here the field lengths are fixed and not separated by any delimiters. Therefore, the control attributes for the Employee element are a little different from those of the delimited case.
For the contained objects there is an additional attribute to specify the object length. The respective lengths of the ssn, name, and salary fields were updated in the t2xml:object_len attribute. So, for contained objects here are the required attributes:
The complete schema definition for the fixed-length case can be found in the file fixedLength-sample.xsd in the source jar. Default Instance Generation The source jar contains the full class declaration for SchemaInstance class. All the handler methods in the class are recursively called to generate the XML instance. Below is the sample code that demonstrates how to use this class to generate an XML instance from a schema definition: SchemaInstance lSchemaInstance= new SchemaInstance(aSchemaFilePathOrURL); The XML instance generated from the schema example above looks like this: <root_element> Populating the Instance With Data These control attributes tell if an element is a container or contained object, and also tell about the container end token, object separator, etc. Therefore, depending upon the control attributes, after the XML instance is generated for a particular schema element, the physical record is read from the flat file and the instance is populated with live data from flat file. This process is repeated for each record defined in the flat file. Only after traversal of the full schema definition (starting from the root element) will a filled-up instance representing the full schema definition be created. If the maxOccurs attribute is "unbounded" for a schema element, the number of XML instances for this element is created as per the availability of records in the flat file; otherwise the actual number is regarded in the schema definition. The lookup for the control attributes and their correct handling is very important when filling the XML instance with data. To start, the implementation class xerces.xs.instance.T2xmlInstance implements the t2xml.IParseProperties and extends the xerces.xs.instance.SchemaInstance class. In the SchemaInstance class, the bare-bones XML elements were generated, but in the derived T2xmlInstance, with the help of the IParseProperties, these XML instances will now be filled up with data from the flat file. In Listing 3 you can see a skeleton representation of the T2xmlInstance class. The last three methods in the class skeleton were overridden in the T2xmlInstance class from the SchemaInstance class to fill up the XML elements generated in the SchemaInstance class. The getRootSchemaElement(.) method is overridden to find the root element in accordance with the control attribute "t2xml:rootelem." The handleParticle(.) method is overridden to look up specific control attributes for container and container type, so that the filler object that fills the data to the generated XML instance is set up properly. The fillupData(.) method is overridden to fill up data in the XML instance based on the control attribute for object marking and the type of filler object passed in. The other methods in the class are helper methods to get the instance fill-up mechanism working. The full source code for T2xmlInstance may be found in the source jar. The T2xmlInstance class uses the control attributes explained in Table 1 to populate the default XML instance with data. To run T2xmlInstance as an application, download source.jar, unzip the contents, and try the following commands for the delimited sample case and the fixed-length case, respectively. {your jdk home}\bin\java -cp .;classes;lib\xerces\resolver.jar;lib\xerces\xercesI mpl.jar;lib\xerces\xml-apis.jar;lib\xerces\xmlParserAPIs.jar xerces.xs.instance. T2xmlInstance test\delimited-sample.xsd test\delimited-input.txt test\delimited-output.xml {your jdk home}\bin\java -cp .;classes;lib\xerces\resolver.jar;lib\xerces\xercesI The XML instance for the delimited and fixed-length case is shown in Listing 4. Because the data structure is exactly the same for these two cases, the generated XML is also identical. Scope for Future Enhancement Conclusion References
XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||