YOUR FEEDBACK
Robert Z. Cashman wrote: I'll be the first one to cry foul once someone does something wrong with the pat...
Cloud Computing Conference
March 22-24, 2009, New York
Register Today and SAVE !..


2008 East
DIAMOND SPONSOR:
Data Direct
Frontiers in Data Access: The Coming Wave in Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
Intel
Virtualization – Path to Predictive Enterprise
Green Hills
IT Security in a Hostile World
JBoss / freedom oss
Practical SOA Approach
GOLD SPONSORS:
Software AG
The Art & Science of SOA: How Governance Enables Adoption
PlateSpin
Effective Planning for Virtual Infrastructure Growth
Fujitsu
Automated Business Process Discovery & Virtualization Service
Ceedo
Workspace Virtualization
Click For 2007 West
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Hierarchy-Based Parsing
Hierarchy-Based Parsing

OmniMark is the granddaddy of XML programming languages, having begun life as XTRAN, an SGML translation language, in 1989. Designed specifi- cally for filtering structured data, Omni- Mark has a streaming programming model, a rule-based program structure, and integrated XML and SGML parsers.

XML programmers are familiar with the tree-based parsing of the DOM and the event-based parsing of SAX. OmniMark uses a third model: hierarchy-based parsing. Where a SAX parser will treat the beginning of an element as one event and the end as a separate event, OmniMark treats the occurrence of an element as a single event that fires a single rule. Since elements can contain nested elements, this leads to the creation of a hierarchy of fired rules, which is an exact model of the hierarchy of the document.

OmniMark is a rule-based language. Program execution begins in a "process" rule. Within a rule you can initiate the parsing of an XML document with the do xml-parse construct:

process
do xml-parse instance scan file #args[1]
output "<HTML>%c</HTML>"
done

The do xml-parse construct sets up the parser for parsing, but the parser isn't started until the parse continuation operator "%c" occurs in a string. In the code fragment above, the output statement outputs "<HTML>", the parser is started by "%c", and the document is processed. When parsing is complete, "</HTML>" is output. This code forms the base of a program hierarchy that converts an XML document to HTML.

A sample XML document is shown in Listing 1. Its root element is "memo". To handle this element we write a "memo" element rule:

element "memo"
output "<BODY>%n<H1>MEMO</H1>%n%c</BODY>"

The "memo" element of the XML document corresponds to the "BODY" element of the HTML document I'm creating, so I wrap the "<BODY>" and "</BODY>" tags around the "%c". I also add "H1" markup and text appropriate to a memo: "<H1>MEMO</H1>". The "%n" in the output string creates a linefeed.

This rule is fired as a result of the parsing initiated by the "%c" in the do xml-parse statement. A stack of element rules is starting to build. As a result of the "%c" in this rule, the parser will resume parsing and fire the "header" element rule:

element "header"
output "<table>%c</table>"

This rule outputs the wrapper tags for a table I'll use to present the memo header information. Another call to "%c" fires up the parser again, resulting in the firing of another rule:

element "from"
output "<tr><td><b>From:</b></td><td>%c</td>"

The program now has a three-deep element stack (memo, header, from) represented by a stack of fired rules, each suspended at the point the "%c" occurs.

If you're familiar with XSLT, you'll recognize that it copies the basic processing approach of OmniMark. XSLT's "apply-templates" command is similar structurally to OmniMark's "%c". However, XSLT implementations tend to use a DOM parser, meaning the whole document is parsed and in memory before processing begins.

Notice that in discussing "%c" I've said it restarts the parser every time it's called. OmniMark doesn't parse the whole document before processing it. Instead, the parser and the program work cooperatively. At the point the "from" element is fired, the parser hasn't gotten any further through the document than the opening tag of the "from" element. (You can confirm this by using the markup datascope of the OmniMark IDE to trace the execution of the program.) Yet the program has already generated part of the output document.

This is OmniMark's streaming approach to XML processing. Neither the input nor the output is held in memory. Instead, the XML input is parsed and processed as it streams into the program, and the output is streamed out directly.

Streamed Data Content
What happens to data content in this parsing model? The "from" element doesn't contain any other elements, only data content. When the parser is kicked into motion by the "%c" in the "from" element rule, it simply streams the data content to the program's current output: the same place the output generated by the program is going. As a result of the execution of this rule, the following occurs:

  1. The program outputs "<tr><td><b>From:</b></td><td>"
  2. The parser outputs "Barney Rubble"
  3. The program outputs "</td></tr>"
As soon as the parser outputs the data content, the parsing of the "from" element is complete, "%c" returns, and the "from" element rule is allowed to finish, popping one level off the element rule stack.

The "to" and "date" rules are then fired in turn. The latter shows how OmniMark handles attributes. The attributes of the current element are collected into an associative array (a "shelf" in OmniMark parlance) where the values are the attribute values and the keys are the attribute names. You can access the attributes using the "attribute" keyword or the "%v" escape sequence, shown here:

element "date"
output "<tr><td><b>Date:</b></td><td>"
|| "%v(year)-%v(month)-%v(day)%c</td>"

Reordering Data
A common reason for choosing a tree-based parser is the frequent need to output data in a different order than it occurs in the XML document. A DOM tree turns a linear data stream into a random access data structure that allows you to access any part of the document. The disadvantage, though, is that the whole document must be in memory - an expensive proposition if you're processing a lot of data. OmniMark provides a mechanism for reordering data without resorting to building the whole tree. It's called referents.

I discussed referents in my November article "XML encoding: A Streaming Approach" (XML-J, Vol. 1, issue 6); there I used them to reorder the contents of an ASCII table. Here I'll use them to build the header of my HTML document. Most HTML pages have a "title" element in the document header. That title is displayed by browsers and search engines, making it easier to identify a document. In many cases, the information you want to place in the title of an HTML page occurs somewhere in the XML document that's your source, in this case the subject line of the memo.

I need to output the header before the body of the HTML document. The perfect place is in the do xml-parse statement that outputs the "HTML" tags. But at this point the parse hasn't even started, and I certainly don't have access to the subject element of the document. The solution is to output a referent whose value I can set later. Here's a rewrite of the process rule that does this:

process
do xml-parse instance scan file #args[1]
output "<HTML>"
|| "<HEAD><TITLE>"
|| referent "subject"
|| "</TITLE></HEAD>"
|| "%c</HTML>"
done

The value of the referent is set when the subject is found, which is when the subject element rule is fired:

element "subject"
output "<tr><td><b>Subject:</b></td><td>"
|| referent "subject"
|| "</td>"
set referent "subject" to "%c"

The element rule itself outputs the referent, since the subject must appear here as well as in the header. It then sets the value of the referent to "%c". As I said earlier, the parser outputs all data content to the current output destination. What the set statement does is make the variable, whose value it is setting, the current output scope for the duration of the set statement. Thus the parser's output is redirected to the referent "subject" while the "subject" element is being parsed.

The use of referents does, of course, involve buffering output. However, this is accomplished using temporary output files rather than memory-based data structures. The use and resolution of referents typically adds less than 10% to the overall execution time.

Querying the Element Context
At the end of the "subject" element the parser encounters the end of the "header" element, collapsing the element stack down to the "memo" level. Thus the header rule is allowed to complete and the "</TABLE>" markup is output.

To process it correctly sometimes it's necessary to know in what context it occurs. Although OmniMark doesn't maintain the complete parse tree, it always maintains the context of the current element through the hierarchical stack of element rules. Querying this environment generally provides all the context information you need to process an element correctly.

For instance, in this program I need to put "td" tags around team names (specified by the "team-name" element) when they occur in the standings table, but not when they occur in plain text. For variety I want to put bold tags around team names if they occur in the body of the memo. To accomplish this I write two element "team-name" rules for each of the two contexts in which a team name can appear:

element "team-name" when ancestor is "standings"
output "<td>%c</td>"
element "team-name" when ancestor isnt "standings"
output "<b>%c</b>"

Each rule is guarded by a test. The ancestor test looks down the current element stack to see if there's an element of the specified name. If there is, the first rule is fired, if not, the second. (Figure 1 shows two slices from the element stack of the program at different times in the processing of the file. The arrow shows the "ancestor" query being made and succeeding in the first case and not in the second.)

Another case of examining the parse state to determine what to do occurs when writing the rule to handle the "person" element. In this example I'm assuming there's a set of pages related to each person in the bowling league. The people pages are located in the directory "people". The page name is the person's name with ".html" appended. I want to make each person's name a link to his or her page.

The problem is that not every reference to a person contains the full proper name. The first paragraph after the standings table mentions "Fred" but doesn't spell out his full name. To fix this, the markup for the "person" element includes a "proper" attribute that provides the person's full proper name if it's not given in the text:

<person proper="Fred Flintstone">Fred</person>

When I create the link to the people pages, I have to check if this attribute has been specified for the particular "person" element I'm looking at. The test for this is simple:

do when attribute "proper" is specified

Here's the full element rule:

element "person"
local stream name
set name to "%c"
output ", " when occurrence > 1
output '<A HREF="people/'
do when attribute "proper" is specified
submit attribute "proper"
else
submit name
done
output '">'
|| name
|| "</A>"

Another environmental inquiry is the "occurrence" keyword that simply returns the number of times the element has occurred consecutively. Here I use it to insert commas between the team members' names. Since I don't know in advance which occurrence is the last one, I output the commas before every occurrence except the first.

Notice here how easily information is made available in context. I don't have to search a DOM tree for the attribute information I need because I deal with the attribute in the context in which it occurs. Since I process the element in the context in which I find it, references to its properties are all local. The context of the program has already isolated the attributes of this specific element for me. All I have to do is ask for them by name 'attribute "proper".' The ability to deal with data in context as the parser creates that context is one of the key advantages of the hierarchical parsing method.

Processing Data Content
I've said that the parser sends data content directly to output. However, this doesn't mean you can't process the data content if you need to. In this program I process the data content in two ways to achieve necessary encodings of the data being output.

In the "person" element rule above I submit the content of the person element (or the "proper" attribute, if appropriate). The submit action causes its argument to be scanned by find rules (see my November article for details on OmniMark find rules). In this case, a single rule does URL encoding. URL encoding is used to ensure that illegal characters don't occur in URLs. Since spaces aren't allowed in URLs and most proper names contain spaces, I URL-encode the names:

find [any except letter or digit or "$-_.!*'(),"] => char
output "%%" || "16ru2fzd" % binary char

This rule finds all URL-illegal characters and replaces them with an escape sequence consisting of a % sign followed by two binary digits representing the character. The expression '"16ru2fzd" % binary char' formats the captured character as two hex digits. (The format string "16ru2fzd" means radix 16, uppercase, field width of 2, fill with zeros.) Unmatched characters simply stream through to output.

The fact that find and element rules don't need to specify where their output is going is a key feature of OmniMark's streaming programming model. Data processing code can be kept simple and generic. The flow of data streams to their destinations can be handled by other code in the context in which a change of destination is required.

Notice also that this processing occurs directly on the XML data as it streams. Data content isn't copied to a function for encoding. The encoding is added to the stream directly as it flows to its destination.

By the way, the output's destination in this program as it stands is the default destination of all OmniMark programs - standard output. You can direct output to a specific destination in code, or you can specify an output file on the command line with the -of command line switch (or the "output" project option in the IDE).

Dealing with Entities
Another form of data content processing is required because the output created by the program is HTML that, as an SGML/XML language, requires all cases of markup characters "<" ">" and "&" to be escaped with the text entities "<", ">", and "&". Of course, those characters were escaped exactly like this in the original XML document, but part of the parser's job is to resolve the entities and replace them with their text equivalents. Therefore the data content streaming from the parser may contain unescaped markup characters.

To fix this, the program must find these characters and replace them with their entity equivalents again. It's not necessary to submit all the data content to find rules to achieve this since OmniMark provides "translate" rules that act like "find" rules for parsed data content. The "translate" rules to do the markup escaping are:

translate "<"
output "<"
translate ">"
output ">"
translate "&"
output "&"

Handling White Space Problems
One other translate rule in the program is active only for a single element (once again, processing is controlled by testing the current element context):

translate white-space* when element is "members"

This solves a problem that arises from the fact that although the "members" element doesn't contain any data content proper, the parser does see and must report the line feeds and spaces between the "<members>" tag, the first "<person>" tag, and between all the subsequent "<person>" tags. All this "white space" constitutes data content for the "members" element and is duly streamed to output.

This causes a problem because I'm trying to put commas between the names of the team members, and HTML takes any white space between characters and represents it as a space, thus giving me a string that looks like this on screen:

Fred Flintstone , Barney Rubble , Bam Bam Rubble , Dino

To get rid of the excess spaces, I must suppress the data content of the "members" element. I can do this using the suppress keyword in place of "%c", but that would suppress the content of its child elements as well. So I get rid of the white space by matching it in a translate rule and outputting nothing in its place. The final result, as it appears in the browser, can be seen in Figure 2.

Handling Errors
The program runs through to the end of the data, building up or tearing down the element stack in response to the element hierarchy of the data. But what happens if there are errors in the XML?

OmniMark validates as it parses. If it finds an error in the XML stream it fires a "markup-error" rule. The programmer can then handle the error in the appropriate way. The markup-error rule in this program just reports the error and halts the program:

markup-error put #error "Markup error: " || #message || " on line " || "d" % #line-number || ".%n" halt

Validation
The same procedure is used when you validate against a DTD. Validation occurs interactively as the document is parsed and processed. All that's needed to turn this program from well-formed to validating parsing is to change the line:

do xml-parse instance scan file #args[1]

to read as follows:

do xml-parse document scan file #args[1]

The "document" keyword activates DTD validation. Of course, you must now supply a DTD as part of the input. Any validation failures will be reported as before by firing the markup-error rule.

Utility of the Hierarchical Model
I've presented a simple processing example in this article, and simple examples tend to be simple in all languages. Where do you experience the real difference between OmniMark's hierarchical model and streaming approach, and the tree- and event-based models of DOM and SAX?

OmniMark's architecture was designed to provide an easy-to-use processing model for structured data and to support the requirements of long-document publishing for high-performance, low overhead translation of massive data sets. The language is specifically designed to minimize data copying and memory usage and provide a high degree of scalability.

The model also simplifies coding of XML processing applications. Although the streaming approach may take a little getting used to for programmers taught to think in terms of data structures, writing code that responds in context to the data as it's processed is a clean and simple way to program XML applications. It lends itself to code that's process-oriented; that is, code that tends to describe the process the program implements in a way that's clear and easy to read.

The key features of the streaming programming model and the hierarchical approach to parsing are that the execution state of a program models the hierarchy of the data, thus providing a natural context in which to interpret and respond to the data. It's also easy to access information on the current context, precisely because the information is in context and thus easy to address. It's notable that the entire program presented in this article (the full final program is in Listing 2) contains only a single variable: the local stream variable "name" in the "person" element rule. Apart from this, the entire document is processed without the use of a single data structure.

XML JOURNAL LATEST STORIES . . .
A round-up of the many themes and topics of interest to infrastructure architects, developers and IT managers featuring at SYS-CON's Cloud Computing Expo being held November 19-21, 2008 at The Fairmont Hotel in San Jose, California. The conference is expecting a record turnout of senio...
SYS-CON Events announced today that the leading global SOA, Virtualization, Cloud Computing and Open Source technology provider FreedomOSS named "Gold Sponsor" of SYS-CON's SOA World Conference & Expo which will take place November 19-21, 2008, at the Fairmont Hotel in the heart of Sil...
Cloud Computing offers significant benefits over traditional solutions for deploying production systems as well as for conducting development and testing activities. This session will distill the unique characteristics of clouds and describe how to best think about deployments in the c...
Intel has just released Intel XML Software Suite 1.2. This latest release helps maximize XML performance, while minimizing the effort for any Enterprise, SOA, SaaS, and Web 2.0 based applications. Intel XML Software Suite 1.2 optimizes XML application performance, takes full advantage ...
SYS-CON Events announced today that the leading global SOA, Virtualization, Cloud Computing and Open Source technology provider Intel named "Gold Sponsor" of SYS-CON's SOA World Conference & Expo which will take place November 19-21, 2008, at the Fairmont Hotel in the heart of Silicon ...
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON FEATURED WHITEPAPERS


ADS BY GOOGLE