|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Feature Hierarchy-Based Parsing
Hierarchy-Based Parsing
By: Mark Baker
Dec. 21, 2000 12:00 AM
OmniMark is the granddaddy of XML programming languages, having begun life as XTRAN, an SGML translation language, in 1989. Designed specifi- cally for filtering structured data, Omni- Mark has a streaming programming model, a rule-based program structure, and integrated XML and SGML parsers. XML programmers are familiar with the tree-based parsing of the DOM and the event-based parsing of SAX. OmniMark uses a third model: hierarchy-based parsing. Where a SAX parser will treat the beginning of an element as one event and the end as a separate event, OmniMark treats the occurrence of an element as a single event that fires a single rule. Since elements can contain nested elements, this leads to the creation of a hierarchy of fired rules, which is an exact model of the hierarchy of the document. OmniMark is a rule-based language. Program execution begins in a "process" rule. Within a rule you can initiate the parsing of an XML document with the do xml-parse construct:
process The do xml-parse construct sets up the parser for parsing, but the parser isn't started until the parse continuation operator "%c" occurs in a string. In the code fragment above, the output statement outputs "<HTML>", the parser is started by "%c", and the document is processed. When parsing is complete, "</HTML>" is output. This code forms the base of a program hierarchy that converts an XML document to HTML. A sample XML document is shown in Listing 1. Its root element is "memo". To handle this element we write a "memo" element rule:
element "memo" The "memo" element of the XML document corresponds to the "BODY" element of the HTML document I'm creating, so I wrap the "<BODY>" and "</BODY>" tags around the "%c". I also add "H1" markup and text appropriate to a memo: "<H1>MEMO</H1>". The "%n" in the output string creates a linefeed. This rule is fired as a result of the parsing initiated by the "%c" in the do xml-parse statement. A stack of element rules is starting to build. As a result of the "%c" in this rule, the parser will resume parsing and fire the "header" element rule:
element "header" This rule outputs the wrapper tags for a table I'll use to present the memo header information. Another call to "%c" fires up the parser again, resulting in the firing of another rule:
element "from" The program now has a three-deep element stack (memo, header, from) represented by a stack of fired rules, each suspended at the point the "%c" occurs. If you're familiar with XSLT, you'll recognize that it copies the basic processing approach of OmniMark. XSLT's "apply-templates" command is similar structurally to OmniMark's "%c". However, XSLT implementations tend to use a DOM parser, meaning the whole document is parsed and in memory before processing begins. Notice that in discussing "%c" I've said it restarts the parser every time it's called. OmniMark doesn't parse the whole document before processing it. Instead, the parser and the program work cooperatively. At the point the "from" element is fired, the parser hasn't gotten any further through the document than the opening tag of the "from" element. (You can confirm this by using the markup datascope of the OmniMark IDE to trace the execution of the program.) Yet the program has already generated part of the output document. This is OmniMark's streaming approach to XML processing. Neither the input nor the output is held in memory. Instead, the XML input is parsed and processed as it streams into the program, and the output is streamed out directly.
Streamed Data Content
The "to" and "date" rules are then fired in turn. The latter shows how OmniMark handles attributes. The attributes of the current element are collected into an associative array (a "shelf" in OmniMark parlance) where the values are the attribute values and the keys are the attribute names. You can access the attributes using the "attribute" keyword or the "%v" escape sequence, shown here:
element "date"
Reordering Data
I discussed referents in my November article "XML encoding: A Streaming Approach" (XML-J, Vol. 1, issue 6); there I used them to reorder the contents of an ASCII table. Here I'll use them to build the header of my HTML document. Most HTML pages have a "title" element in the document header. That title is displayed by browsers and search engines, making it easier to identify a document. In many cases, the information you want to place in the title of an HTML page occurs somewhere in the XML document that's your source, in this case the subject line of the memo. I need to output the header before the body of the HTML document. The perfect place is in the do xml-parse statement that outputs the "HTML" tags. But at this point the parse hasn't even started, and I certainly don't have access to the subject element of the document. The solution is to output a referent whose value I can set later. Here's a rewrite of the process rule that does this:
process The value of the referent is set when the subject is found, which is when the subject element rule is fired:
element "subject" The element rule itself outputs the referent, since the subject must appear here as well as in the header. It then sets the value of the referent to "%c". As I said earlier, the parser outputs all data content to the current output destination. What the set statement does is make the variable, whose value it is setting, the current output scope for the duration of the set statement. Thus the parser's output is redirected to the referent "subject" while the "subject" element is being parsed. The use of referents does, of course, involve buffering output. However, this is accomplished using temporary output files rather than memory-based data structures. The use and resolution of referents typically adds less than 10% to the overall execution time.
Querying the Element Context
To process it correctly sometimes it's necessary to know in what context it occurs. Although OmniMark doesn't maintain the complete parse tree, it always maintains the context of the current element through the hierarchical stack of element rules. Querying this environment generally provides all the context information you need to process an element correctly. For instance, in this program I need to put "td" tags around team names (specified by the "team-name" element) when they occur in the standings table, but not when they occur in plain text. For variety I want to put bold tags around team names if they occur in the body of the memo. To accomplish this I write two element "team-name" rules for each of the two contexts in which a team name can appear:
element "team-name" when ancestor is "standings" Each rule is guarded by a test. The ancestor test looks down the current element stack to see if there's an element of the specified name. If there is, the first rule is fired, if not, the second. (Figure 1 shows two slices from the element stack of the program at different times in the processing of the file. The arrow shows the "ancestor" query being made and succeeding in the first case and not in the second.) Another case of examining the parse state to determine what to do occurs when writing the rule to handle the "person" element. In this example I'm assuming there's a set of pages related to each person in the bowling league. The people pages are located in the directory "people". The page name is the person's name with ".html" appended. I want to make each person's name a link to his or her page. The problem is that not every reference to a person contains the full proper name. The first paragraph after the standings table mentions "Fred" but doesn't spell out his full name. To fix this, the markup for the "person" element includes a "proper" attribute that provides the person's full proper name if it's not given in the text: <person proper="Fred Flintstone">Fred</person> When I create the link to the people pages, I have to check if this attribute has been specified for the particular "person" element I'm looking at. The test for this is simple: do when attribute "proper" is specified Here's the full element rule:
element "person" Another environmental inquiry is the "occurrence" keyword that simply returns the number of times the element has occurred consecutively. Here I use it to insert commas between the team members' names. Since I don't know in advance which occurrence is the last one, I output the commas before every occurrence except the first. Notice here how easily information is made available in context. I don't have to search a DOM tree for the attribute information I need because I deal with the attribute in the context in which it occurs. Since I process the element in the context in which I find it, references to its properties are all local. The context of the program has already isolated the attributes of this specific element for me. All I have to do is ask for them by name 'attribute "proper".' The ability to deal with data in context as the parser creates that context is one of the key advantages of the hierarchical parsing method.
Processing Data Content
In the "person" element rule above I submit the content of the person element (or the "proper" attribute, if appropriate). The submit action causes its argument to be scanned by find rules (see my November article for details on OmniMark find rules). In this case, a single rule does URL encoding. URL encoding is used to ensure that illegal characters don't occur in URLs. Since spaces aren't allowed in URLs and most proper names contain spaces, I URL-encode the names:
find [any except letter or digit or "$-_.!*'(),"] => char This rule finds all URL-illegal characters and replaces them with an escape sequence consisting of a % sign followed by two binary digits representing the character. The expression '"16ru2fzd" % binary char' formats the captured character as two hex digits. (The format string "16ru2fzd" means radix 16, uppercase, field width of 2, fill with zeros.) Unmatched characters simply stream through to output. The fact that find and element rules don't need to specify where their output is going is a key feature of OmniMark's streaming programming model. Data processing code can be kept simple and generic. The flow of data streams to their destinations can be handled by other code in the context in which a change of destination is required. Notice also that this processing occurs directly on the XML data as it streams. Data content isn't copied to a function for encoding. The encoding is added to the stream directly as it flows to its destination. By the way, the output's destination in this program as it stands is the default destination of all OmniMark programs - standard output. You can direct output to a specific destination in code, or you can specify an output file on the command line with the -of command line switch (or the "output" project option in the IDE).
Dealing with Entities
To fix this, the program must find these characters and replace them with their entity equivalents again. It's not necessary to submit all the data content to find rules to achieve this since OmniMark provides "translate" rules that act like "find" rules for parsed data content. The "translate" rules to do the markup escaping are:
translate "<"
Handling White Space Problems
translate white-space* when element is "members" This solves a problem that arises from the fact that although the "members" element doesn't contain any data content proper, the parser does see and must report the line feeds and spaces between the "<members>" tag, the first "<person>" tag, and between all the subsequent "<person>" tags. All this "white space" constitutes data content for the "members" element and is duly streamed to output. This causes a problem because I'm trying to put commas between the names of the team members, and HTML takes any white space between characters and represents it as a space, thus giving me a string that looks like this on screen: Fred Flintstone , Barney Rubble , Bam Bam Rubble , Dino To get rid of the excess spaces, I must suppress the data content of the "members" element. I can do this using the suppress keyword in place of "%c", but that would suppress the content of its child elements as well. So I get rid of the white space by matching it in a translate rule and outputting nothing in its place. The final result, as it appears in the browser, can be seen in Figure 2.
Handling Errors
OmniMark validates as it parses. If it finds an error in the XML stream it fires a "markup-error" rule. The programmer can then handle the error in the appropriate way. The markup-error rule in this program just reports the error and halts the program:
markup-error put #error "Markup error: " || #message || " on line " || "d" % #line-number || ".%n" halt
Validation
do xml-parse instance scan file #args[1] to read as follows: do xml-parse document scan file #args[1] The "document" keyword activates DTD validation. Of course, you must now supply a DTD as part of the input. Any validation failures will be reported as before by firing the markup-error rule.
Utility of the Hierarchical Model
OmniMark's architecture was designed to provide an easy-to-use processing model for structured data and to support the requirements of long-document publishing for high-performance, low overhead translation of massive data sets. The language is specifically designed to minimize data copying and memory usage and provide a high degree of scalability. The model also simplifies coding of XML processing applications. Although the streaming approach may take a little getting used to for programmers taught to think in terms of data structures, writing code that responds in context to the data as it's processed is a clean and simple way to program XML applications. It lends itself to code that's process-oriented; that is, code that tends to describe the process the program implements in a way that's clear and easy to read. The key features of the streaming programming model and the hierarchical approach to parsing are that the execution state of a program models the hierarchy of the data, thus providing a natural context in which to interpret and respond to the data. It's also easy to access information on the current context, precisely because the information is in context and thus easy to address. It's notable that the entire program presented in this article (the full final program is in Listing 2) contains only a single variable: the local stream variable "name" in the "person" element rule. Apart from this, the entire document is processed without the use of a single data structure. XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||