YOUR FEEDBACK
John Portnov wrote: This code does not work for me. I created a new website and a C# console applic...
AJAXWorld RIA Conference
$300 Savings Expire August 22
Register Today and SAVE!


2008 East
DIAMOND SPONSOR:
Data Direct
Frontiers in Data Access: The Coming Wave in Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
Intel
Virtualization – Path to Predictive Enterprise
Green Hills
IT Security in a Hostile World
JBoss / freedom oss
Practical SOA Approach
GOLD SPONSORS:
Software AG
The Art & Science of SOA: How Governance Enables Adoption
PlateSpin
Effective Planning for Virtual Infrastructure Growth
Fujitsu
Automated Business Process Discovery & Virtualization Service
Ceedo
Workspace Virtualization
Click For 2007 West
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


XML Journal Feature: Transforming Large XML Documents, An Alternative to XSLT
XSL standard also became very popular for transforming XML data to XML, text, PDF, etc

As we know in classical XSLT, transformation of the full input DOM is loaded in the memory to do the transformation, and so there is a fixed limit on the number of "Person" elements the XSL transformation can handle without going out of memory. The success of transformation depends upon the available system resources, but passing a very large document might choke the full system resources, and it's not feasible to pump up the system resources every time to get a transformation completed. So with classical XSLT transformation, every system has an optimum limit on the document size, which could be transformed. To get rid of this major shortfall, loading the input source in an incremental way seems to be a viable solution, but this approach cannot be applied to a classical XSL transformation because there is no clue about the structure of the output file.

For XML to XML transformation, there is an advantage of knowing the output format of the file if a schema is present to describe the output file. Often enterprise application where XML to XML transformations are carried out, there is a schema present to describe the output file so that after the transformation the transformed file may be validated against the schema. In such a case where a schema is present to define the output xml data this schema-based transformation may be used to transform a document using an incremental approach of loading the input document defined in control attributes in the schema definition.

Let's consider the example above to see how a schema based-approach can ideally load the input DOM incrementally and discard the processed chunks of data after transformation to successfully and ideally complete an infinitely large document that provides only resources that are similar to the classical XSL transformation.

The schema for the above output in the example could be found in file "personsinfo.xsd." Some additional attributes are added to some of the elements in the schema declaration to enable the transformation.

In Listing 1, the basic schema definition is pretty simple. To add the transformation, instruction attributes are added from the name space xmlns:saxTran="http://oracle.schemaTransform/saxTran" so that instructions could be implemented to carry out the actual transformation. All of the control attributes in the Listing 1 schema definition are shown in italics. Right now only few elements that are shown in Table 1 are used, but depending on the complexity of transformation these attributes will increase.

Basic SAX-based Transformation Implementation
Now, let's go under the hood of a basic implementation and see how an incremental loading of the input DOM is possible using the schema-driven transformation approach.

Figure 1 shows the approach. The schema, which defines the target XML, has special attributes to match and map elements from input to target. So, first from the schema the default output document structure needs to be constructed without any values just according to the schema definition up to the element for which the saxTran:streamNode is defined. The saxTran:match attribute for that element will tell about the XPath of the input node on which streaming needs to be done. On each occurrence of the input node, a partial DOM has been constructed in the memory. All of the XPath references on the schema definition satisfied by the partially loaded DOM have been evaluated and values are replaced in the already created skeleton DOM from the schema definition. There might be saxTran:function attributes that also need the same XPath for the function evaluation; for these cases the value for that XPath is calculated and added to the expression for the function call as an argument. Once the XPath definition in the schema definition is fully satisfied, the node on which the streaming is applied is unloaded from the memory and the subsequent one is loaded for processing. Once all of the references of the matching XPath are dealt with from the input source, then the functions that are there to be evaluated are processed and the final value is populated in the node containing that function.

In the above example, the node on which streaming is applied is the /OrgChart/Office/Department/Person node. So, at any time during the transformation process the in memory node will look like the following:


<OrgChart>
<Office>
<Department>
<Person>
<First>Vernon</First>
<Last>Callaby</Last>
<Title>Office Manager</Title>
<PhoneExt>582</PhoneExt>
<EMail>v.callaby@nanonull.com</EMail>
<Shares>1500</Shares>
</Person>
</Department>
</Office>
</OrgChart>
For each Person node the values of all the matching XPath like "./First","./Last" etc. are populated in the target skeleton structure. For the aggregate functions like the count, sum, avg, etc., the function expression is updated continuously with actual XPath value.

So, for the three aggregate functions, the target node at the middle of the transformation process when three Person nodes are done will look like:

<TotalPersons><![CDATA[saxTran:count(1,1,1)]]></TotalPersons>
<AvgSharePerPerson><![CDATA[saxTran:avg(1500,0,NaN)]]></</
   AvgSharePerPerson>
<TotalSharesWithPersons><![CDATA[saxTran:sum(1500,0,NaN)]]></</
   TotalSharesWithPersons>

So, with each Person node repeating, there will be additional arguments added to the functions and populated with actual XPath values. All of the functions will be evaluated when the input source is fully read to populate the final values for these elements.

When using this approach there is no limit to the input file, which can be processed. As the memory is always replenished after processing one stream node, it can handle transformation of infinitely long XML documents without fail. For streaming output DOM too, control attributes might be specified so that it could be serialized after transformation of a required chunk of data. It will provide the flexibility of using any large XML in a transformation process, which was impossible with previous XSLT processors.

Summary
Huge, database dumps or XML coming out of serialized data records can now be transformed effectively with this approach. It can actually augment the classical XSLT engine to provide a fail-proof transformation engine. This is always seen in transforming large XML files using XSLT, and the bottleneck lies in loading the input XML as DOM tree. In most of the cases the XML data repeats for a particular element (a data record) for thousands of times, but loading all of them at once chokes the memory of the transformer. This approach can then augment the classical XSLT engine to stream the input source, and the transformation could be completed without fail. It can also store some of the transient variables within itself to provide the information to the next set of XSLT transformations in the pipeline. This approach will open a new dimension to the world of XML transformation and provide a solution for the impossible task of transforming large XML files. Here I discussed only the approach, so be sure to check back to find the implementation in my next article.

References

About Indroniel Deb Roy
Indroniel Deb Roy works as an UI architect for Packeteer Inc. Previously he contributed to the development of Oracle XML Publisher as development manager and participated actively in developing Novell's exteNd XML integration server. He has a passion for innovation and works with various XML and J2EE technologies.

YOUR FEEDBACK
XML News Desk wrote: XML Journal Feature: Transforming Large XML Documents, An Alternative to XSLT. With the evolution of XML, the XSL standard also became very popular for transforming XML data to XML, text, PDF, etc. However there are some limitations to the XSLT transformation. Today's XSLT processors rely on holding input data in memory as a DOM tree while the transformation is taking place. The tree structure in memory can be as much as ten times the original data size, so in practice, the limit on data size for an XSLT conversion is just a few megabytes. As a result it can only handle XML documents with moderate size - to be processed as the full input, DOM needs to be in the memory for any XSL transformation.
XML JOURNAL LATEST STORIES . . .
ISO said Friday that the appeals made by Brazil, India, South Africa and Venezuela protesting the standardization of Microsoft’s Office Open XML (OOXML) file format hadn’t gone anywhere – it was unclear whether any of them had any standing anyway – but since they “failed to g...
Red Hat CTO Brian Stevens, Citrix CTO Simon Crosby, Egenera CTO Pete Manca, Allen Stewart, Group Manager, Windows Virtualization at Microsoft, and Brian Duckering, Sr. Director of Products and Alliances at Symantec were the top industry executives who joined Jeremy Geelan in the 4th Fl...
Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted to be...
Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store.
Vordel unveiled version 5.1 of its XML network infrastructure products, to accelerate, manage and protect XML applications. Vordel 5.1 addresses the need for lifecycle management of policy across the SOA. By combining the central management of SOA policies with distributed enforcement ...
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON FEATURED WHITEPAPERS


ADS BY GOOGLE
BREAKING XML NEWS
Avineon, Inc. (http://www.avineon.com), a successful provider of IT, geospatial, engineering and pro...