|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Data Management Storing XML in Relational Database Management Systems
Using the Edge table approach
By: Selim Mimaroglu
Mar. 6, 2006 04:00 PM
When we need to store XML in a Relational Database Management System (RDBMS) we have to think of nonnative ways to store it, since the structure of XML does not fit the flat relations of the RDBMSs. Although some RDBMS vendors support XML storage and XQuery such as Oracle 10g (release 2), this support is not yet common. Some of the well-known RDBMSs are Oracle, DB2, Informix, Sybase IQ, SQL Server, MySQL, and PostgreSQL. If you have one of these products, or a similar product, you should be interested in reading this article.
XML has a lot of bells and whistles, but it can be viewed simply as a tree. It's possible to represent the XML nodes as vertices of the tree and the relations between the XML nodes as the edges. Let's look at an example of how we can take an XML fragment and represent it as a tree. Figure 1 shows a simple XML document and Figure 2 shows its tree representation. The root of the XML document is the root of the tree too. Note that in XML order is very important, therefore when we create a tree of XML we have to capture the order, which is done in Figure 2. For example, the attribute isbn comes before the element title, which comes before the element authors. Figure 2 shows the elements and attributes in ovals, and their corresponding values in rectangular boxes. The tree you see in Figure 2 is a nice representation of XML that makes it easier for us to gather the information presented in Figure 1 in a graphical way. The attribute isbn carries a flag of @, which indicates that it is an attribute. Unfortunately, storing Figure 2 as a graph into a database isn't of much use either. We will shred the XML tree into pieces, each of which can be stored in the rows of a database table. However, in order to do this we first have to number all of the vertices of the tree as shown in Figure 3. Numbering is essential for capturing the parent-child and ancestor-descendant relations, which will be used for answering XML queries.
Numbering an XML Tree The red dashed line shows how to number the XML tree. It traverses the tree in a top-down and left-to-right fashion. Note that the root element book has the lowest number in the tree. The numbers of @isbn, title, and authors are 2, 4, and 6, which in fact capture the order of these nodes.
Edge Table id is the identification number of a vertex (element, attribute, text box, etc.). Identification numbers are shown in Figure 3. id is generally integer type. parentid is the parent's identification number. For example, the parentid of the title element is 1 (same for @isbn and authors). The type integer is the usual choice. name is the name of an element or an attribute. Text nodes (rectangular boxes) do not have names, but we name them as text for better performance. name can be char or varchar type. Data type char stands for character, and varchar stands for variable character. Element tag names and attribute names are generally short, usually not more than 50 characters. value is the value of an element or an attribute. The content of this column is generally larger than the other columns. In most cases using varchar is the most efficient solution. For example in Figure 1, the value of the title element, <title>XML in Use</title>, has the largest character content: a total of 10 characters. For simplicity, in this article I have been using a small XML sample (and I will continue using the same XML below). For interested readers, I have provided much larger XML samples at www.cs.umb.edu/~smimarog/xmlsample. For these examples I set the value type to varchar(550) in my database. varchar(550) is space saver compared to char(550). varchar(550) indicates that the maximum length of the characters will be no more than 550, but when it is less than 550, say 35, then the remaining 515 characters will not be wasted. However when you have the type char(550) and put content into it that has only 35 characters, unfortunately 515 characters will be wasted. lastdesc is the number that holds the last descendant of a node. Figure 3 shows that lastdesc of book is 10, the lastdesc of title is 5, etc. We are going to use the last descendant information for answering XPath queries that contain // (double slash). Remember that // is the shortcut of descendant-or-self::node(). I will show how this is used in an example later in this article.
Parsing XML SAX is an event-driven API. It notifies the user of various events when parsing the XML document in top-down fashion, such as startDocument, endDocument, startElement, endElement, processingInstructions, etc. It's the programmer's responsibility to take action upon notification of these events. In Java this is done by implementing the ContentHandler interface or by extending the DefaultHandler class. SAX parsers check for well-formedness and optionally for validity. In Java the errors and the warnings are provided by the ErrorHandler interface. The parser will send a "Fatal Error" when it encounters a well-formedness error, and an "Error" for all other errors. Let's look at one of the events closely:
public void startElement The SAX parser will invoke this event at the beginning of every element. The SAX parser provides the namespace URI, the local name, the qualified name (qName), and the list of attributes of this element. This event is accompanied with the endElement event when the parser reaches the end of this element, but before invoking the endElement the parser will provide the contents of the element in order. YOUR FEEDBACK
XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||