YOUR FEEDBACK
John Portnov wrote: This code does not work for me. I created a new website and a C# console applic...
AJAXWorld RIA Conference
$300 Savings Expire August 22
Register Today and SAVE!


2008 East
DIAMOND SPONSOR:
Data Direct
Frontiers in Data Access: The Coming Wave in Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
Intel
Virtualization – Path to Predictive Enterprise
Green Hills
IT Security in a Hostile World
JBoss / freedom oss
Practical SOA Approach
GOLD SPONSORS:
Software AG
The Art & Science of SOA: How Governance Enables Adoption
PlateSpin
Effective Planning for Virtual Infrastructure Growth
Fujitsu
Automated Business Process Discovery & Virtualization Service
Ceedo
Workspace Virtualization
Click For 2007 West
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Storing XML in Relational Database Management Systems
Using the Edge table approach

When we need to store XML in a Relational Database Management System (RDBMS) we have to think of nonnative ways to store it, since the structure of XML does not fit the flat relations of the RDBMSs. Although some RDBMS vendors support XML storage and XQuery such as Oracle 10g (release 2), this support is not yet common. Some of the well-known RDBMSs are Oracle, DB2, Informix, Sybase IQ, SQL Server, MySQL, and PostgreSQL. If you have one of these products, or a similar product, you should be interested in reading this article.

There are several ways to store XML in an RDMS. One possibility is to store XML in big text columns such as CLOB (Character Large Object). Some advantages of storing XML as CLOB are: there is no preprocessing, and the originality of the XML document is preserved. Unfortunately, this method has some major disadvantages - neither XQuery nor SQL will work on the text columns. We would like to query the stored data, because reading the whole document and extracting needed parts manually is simply impractical. We would like to find ways to store XML in a database that would allow us to query the data easily. In this article I am going to show you a method that is known as shredding XML into the edge table. This is a very general method that will work on any XML document regardless of having a schema.

XML has a lot of bells and whistles, but it can be viewed simply as a tree. It's possible to represent the XML nodes as vertices of the tree and the relations between the XML nodes as the edges. Let's look at an example of how we can take an XML fragment and represent it as a tree. Figure 1 shows a simple XML document and Figure 2 shows its tree representation. The root of the XML document is the root of the tree too. Note that in XML order is very important, therefore when we create a tree of XML we have to capture the order, which is done in Figure 2. For example, the attribute isbn comes before the element title, which comes before the element authors.

Figure 2 shows the elements and attributes in ovals, and their corresponding values in rectangular boxes. The tree you see in Figure 2 is a nice representation of XML that makes it easier for us to gather the information presented in Figure 1 in a graphical way. The attribute isbn carries a flag of @, which indicates that it is an attribute. Unfortunately, storing Figure 2 as a graph into a database isn't of much use either. We will shred the XML tree into pieces, each of which can be stored in the rows of a database table. However, in order to do this we first have to number all of the vertices of the tree as shown in Figure 3. Numbering is essential for capturing the parent-child and ancestor-descendant relations, which will be used for answering XML queries.

Numbering an XML Tree
We will number the vertices in a top-down, left-to-right fashion using preorder traversal, which is numbering a vertex when we see it for the first time and then repeating this process until all vertices are numbered.

The red dashed line shows how to number the XML tree. It traverses the tree in a top-down and left-to-right fashion. Note that the root element book has the lowest number in the tree. The numbers of @isbn, title, and authors are 2, 4, and 6, which in fact capture the order of these nodes.

Edge Table
Now that we have numbered the XML tree, we can store it in the Edge table. An Edge table has id, parentid, name, value, and lastdesc attributes (columns) as shown in Figure 4.

id is the identification number of a vertex (element, attribute, text box, etc.). Identification numbers are shown in Figure 3. id is generally integer type.

parentid is the parent's identification number. For example, the parentid of the title element is 1 (same for @isbn and authors). The type integer is the usual choice.

name is the name of an element or an attribute. Text nodes (rectangular boxes) do not have names, but we name them as text for better performance. name can be char or varchar type. Data type char stands for character, and varchar stands for variable character. Element tag names and attribute names are generally short, usually not more than 50 characters.

value is the value of an element or an attribute. The content of this column is generally larger than the other columns. In most cases using varchar is the most efficient solution. For example in Figure 1, the value of the title element, <title>XML in Use</title>, has the largest character content: a total of 10 characters.

For simplicity, in this article I have been using a small XML sample (and I will continue using the same XML below). For interested readers, I have provided much larger XML samples at www.cs.umb.edu/~smimarog/xmlsample. For these examples I set the value type to varchar(550) in my database. varchar(550) is space saver compared to char(550). varchar(550) indicates that the maximum length of the characters will be no more than 550, but when it is less than 550, say 35, then the remaining 515 characters will not be wasted. However when you have the type char(550) and put content into it that has only 35 characters, unfortunately 515 characters will be wasted.

lastdesc is the number that holds the last descendant of a node. Figure 3 shows that lastdesc of book is 10, the lastdesc of title is 5, etc. We are going to use the last descendant information for answering XPath queries that contain // (double slash). Remember that // is the shortcut of descendant-or-self::node(). I will show how this is used in an example later in this article.

Parsing XML
Parsing XML for storage into an RDBMS is known as shredding, as I mentioned earlier. There are several dozens of XML parsers available, which can be categorized into two main groups: SAX (Simple API for XML) parsers, and DOM (Document Object Model) parsers. SAX is the gold standard of XML APIs. SAX implementations are fast and memory efficient. SAX is available for most of the popular programming languages such as Java, C++, Visual Basic, Python, and Perl. Java comes with a built-in SAX parser. Apache's Xerces (http://xerces.apache.org) provides a SAX parser for C++, Java, and Perl.

SAX is an event-driven API. It notifies the user of various events when parsing the XML document in top-down fashion, such as startDocument, endDocument, startElement, endElement, processingInstructions, etc. It's the programmer's responsibility to take action upon notification of these events. In Java this is done by implementing the ContentHandler interface or by extending the DefaultHandler class. SAX parsers check for well-formedness and optionally for validity. In Java the errors and the warnings are provided by the ErrorHandler interface. The parser will send a "Fatal Error" when it encounters a well-formedness error, and an "Error" for all other errors.

Let's look at one of the events closely:

public void startElement
      (String namespaceURI,
      String localName,
      String qName,
      Attributes atts)
throws SAXException

The SAX parser will invoke this event at the beginning of every element. The SAX parser provides the namespace URI, the local name, the qualified name (qName), and the list of attributes of this element. This event is accompanied with the endElement event when the parser reaches the end of this element, but before invoking the endElement the parser will provide the contents of the element in order.

About Selim Mimaroglu
Selim Mimaroglu is a PhD candidate in computer science at the University of Massachusetts in Boston. He holds an MS in computer science from that school and has a BS in electrical engineering.

YOUR FEEDBACK
XML Developer's Journal News Desk wrote: DocSoft says its new enterprise search technology, called 'Element' unleashes the power of structured data within a company's repository. With Element as a plug-n-play addition to a company's network, users will be able to search for stored data across the network 'smarter,' thus more efficiently, the company says. Element indexes XML-originated documents according to each tag or 'element,' which provides what the company calls 'context searching.' Element can even search metadata embedded into virtually any file format using Adobe's eXtensible Metadata Platform (XMP).
XML JOURNAL LATEST STORIES . . .
ISO said Friday that the appeals made by Brazil, India, South Africa and Venezuela protesting the standardization of Microsoft’s Office Open XML (OOXML) file format hadn’t gone anywhere – it was unclear whether any of them had any standing anyway – but since they “failed to g...
Red Hat CTO Brian Stevens, Citrix CTO Simon Crosby, Egenera CTO Pete Manca, Allen Stewart, Group Manager, Windows Virtualization at Microsoft, and Brian Duckering, Sr. Director of Products and Alliances at Symantec were the top industry executives who joined Jeremy Geelan in the 4th Fl...
Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted to be...
Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store.
Vordel unveiled version 5.1 of its XML network infrastructure products, to accelerate, manage and protect XML applications. Vordel 5.1 addresses the need for lifecycle management of policy across the SOA. By combining the central management of SOA policies with distributed enforcement ...
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON FEATURED WHITEPAPERS


ADS BY GOOGLE
BREAKING XML NEWS
Avineon, Inc. (http://www.avineon.com), a successful provider of IT, geospatial, engineering and pro...