YOUR FEEDBACK
Jeremy Geelan wrote: As mentioned in the Call for Papers we particularly welcome speaking proposals o...
AJAXWorld RIA Conference
$300 Savings Expire August 29
Register Today and SAVE!


2008 East
DIAMOND SPONSOR:
Data Direct
Frontiers in Data Access: The Coming Wave in Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
Intel
Virtualization – Path to Predictive Enterprise
Green Hills
IT Security in a Hostile World
JBoss / freedom oss
Practical SOA Approach
GOLD SPONSORS:
Software AG
The Art & Science of SOA: How Governance Enables Adoption
PlateSpin
Effective Planning for Virtual Infrastructure Growth
Fujitsu
Automated Business Process Discovery & Virtualization Service
Ceedo
Workspace Virtualization
Click For 2007 West
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Replace DTDs? Why?
Replace DTDs? Why?

Of all the standards to accompany XML that are currently in progress at the W3C, few are more anxiously awaited than the Schema standard - the specification that provides an alternative to XML 1.0 DTDs as a way to describe a document's structure. But what's wrong with XML 1.0 DTDs? How many alternatives have been proposed, and by whom? Why didn't the W3C address these concerns in the original XML 1.0 specification instead of waiting until now? I'll answer those questions in this column, and in my next column we'll look at the current state of the W3C Schema Working Group's unfinished proposal.

What Can They Do?
Just as a compiler can process the source code of a particular programming language more effectively if the program's data structures are declared up front, an XML processor is more efficient if it knows what kind of data structures to expect before it begins reading a document. XML 1.0 DTDs - which I will hereafter refer to as DTDs, although technically schemas express DTDs as well - can have five or six kinds of declarations, depending on whether you consider comments to be declarations (the XML spec is vague on this point):

  • Element type declarations: An element type is a named class of elements, such as h1, img or p in HTML or para or listitem in the DocBook DTD.
  • Attribute list declarations: An attribute list declaration lists the attributes for a given element type. The attribute list for HTML's img element type includes the src, alt and align attributes.
  • Entity declarations: Entities name collections of information that a DTD or document can reuse elsewhere. An entity may represent a single character of text, a string of text or a complete file sitting outside the DTD.
  • Notation declarations: When a DTD declares an external non-XML, or "unparsed" entity, it must identify the entity's format. A notation declaration tells the processor: "Here's a legal format for this document type's unparsed entities."
  • Comments: These look just like they look in HTML: <!-- like this -->. This is information for the parser to ignore.

    What's wrong with these?

    Weak Data Typing
    The most common complaint about XML from people who come to it from the database and programming worlds (as opposed to those coming from the SGML publishing and HTML Web design worlds) is the lack of data typing. When these developers declare or define a named piece of information - for example, a field in a database or a variable in some Java or C++ code - they're accustomed to naming its type and then assuming that the processing engine underneath their application will ensure that any information stuffed into that slot conforms to that type. Once they declare an XML Quantity or RetailPrice element type, they don't want to write extra application code to ensure that the strings between the start and end tags really are integers and currency figures. Extra error-checking code isn't just annoying to write; it adds fat to the thin clients that XML is supposed to be so great for.

    This wasn't a big deal in the SGML world because nearly every application was a publishing application. With XML's popularity in e-commerce development, data values like quantities and especially prices become more important. Although XML 1.0 offers a few types that help constrain attribute values, classic types such as integers, real numbers, Booleans and dates aren't among the choices, and application developers need them for element content as well as attribute values.

    Document Structure Not Stored in an XML Document
    DTD declarations have their own syntax that, despite using the "< >" angle brackets, is quite different from XML document syntax. Many newcomers to XML ask why XML isn't used to represent the structure of its own documents. The original answer was that XML was designed to be completely compatible with SGML, which had a larger base of applications and tools than most new XML users realize. These applications and tools played a big role in XML's initial jumpstart.

    Since then, a revision to the SGML standard allows for legal SGML documents without the DTD declarations used to specify document structure - that is, to have what the XML world calls "well-formed documents." If an XML document with no DTD can still be a legal SGML document, then the primary reason for using SGML DTD syntax no longer applies.

    Another argument against specifying DTD structure with XML elements was that it would be confusing to include elements that describe other elements right in there with the elements that they describe. As it turned out, no one does this anyway; schema documents are always kept separate from the documents they describe, and documents point to their schemas with a processing instruction, a namespace declaration or some other mechanism.

    Using XML elements to describe document structures has several benefits. It makes these structures much easier to develop because you can use any XML editor to edit and manipulate them - and I mean any XML editor, even the lame ones that merely dump your document to a visual tree and then write that tree back out when you save your document. (Paragraphs of text like the ones you're reading here are very cumbersome to edit on such an editor, but a schema document is naturally treelike.) Application development is also easier for documents whose structure is stored in a well-formed XML document, because applications have easier access to information about document structure. SAX and DOM, the two current XML API standards, offer very little to an application that wants to check DTD information such as an attribute's declared type or whether a particular element is optional. With document structure definitions stored in a DOM tree or triggering the same SAX events that the document's elements trigger, an application can find out all it wants about that structure.

    No Inheritance
    A key reason for XML's popularity among system developers is its ability to easily describe fairly complex data structures. You don't have to squeeze everything into tables; if you like, you can represent a data structure as a hierarchical tree or, with the help of ID and IDREF attributes, as a directed graph.

    One of the great features of the object-oriented world is the ability to define data structures as extensions of existing structures. With a well-designed hierarchy of object classes inheriting from each other, simple changes can affect as much or as little of this hierarchy as you wish.

    Developers with object-oriented experience appreciate XML's ability to define and manipulate complex data stuctures, but they know that specifying every detail of every data structure from the ground up isn't the most efficient way to develop a system. They want a way to base a new element type on an existing one.

    Potential Messiness of Parameter Entities
    A parameter entity is a string of text or an external file that's been named so that it can be easily plugged into DTDs. The former, an "internal parameter entity," may contain a few attribute declarations that you can reuse in the attribute list declarations of several element types; an external parameter entity could be a file whose declarations will be used in multiple DTDs.

    To keep the design of complex DTDs modular and maintainable, internal parameter entities sometimes build on each other in multiple layers, leaving you with references to parameter entities that have parameter entity references themselves - and those may refer in turn to parameter entities that contain more parameter entity references. Because it's all implemented using string substitution, it can get messy quickly.

    Specialized data structures suited to each of these purposes would give developers more robust components to mix and match when building a document type's structure.

    Weak Self-Documentation Facilities
    As with XML documents - and, for that matter, HTML documents - you can put comments in DTDs that the processor will ignore by putting them between the <!-- and --> delimiters. Like anyone else defining data structures, DTD authors have been encouraged to use these comments to explain the use of these data structures, but in keeping with tradition, they often skimp on this duty. Utilities do exist that compile reports on DTDs by examining the sibling and parent relationships of the various element types, but serious automation of documentation generation can only go so far because of the lack of clues about each comment's purpose. Java, on the other hand, offers the @fieldname notation to identify specific fields of information in the header of a class or method's source code, making it easier for an automated utility such as javadoc to create useful documentation easily with no human intervention.

    It's ironic that Java is better than XML at allowing automated documentation generation, for two reasons. First, a big factor in the popularity of SGML was the way it easily let developers create systems that automated the creation of print, Web, WinHelp and CD documentation. Second, the original idea for XML, like Java, came from Sun; it was Sun's Online Information Technology Architect Jon Bosak who put together the W3C Working Group that devised a simpler version of SGML that would work more easily over the Web.

    Replacement Candidates
    Three groups of W3C member companies and a mailing list devoted to cutting-edge XML issues each assembled alternatives to XML 1.0 DTDs and submitted them to the W3C. Each proposal addresses some or all of the problems described here. Just as schemas express DTDs as much as the SGML-like XML 1.0 style does, XML 1.0 DTDs also qualify as "schemas," but in common practice people refer to the XML 1.0 way as "DTDs" and the new ways as "schemas." In addition to the W3C's Schema proposal, you may have heard of eight other schema proposals, but really only four were submitted - other names refer to earlier names or subsets of these four.

    A group of eight authors, five of whom worked for Microsoft or DataChannel (a Redmond company that's done a lot of XML work with Microsoft) submitted the XML-Data proposal to the W3C on January 5, 1998, making it the only proposal to predate XML's ascent to Recommendation status. A simplified version of XML-Data known as XML-Data Reduced, or XDR, was submitted to the W3C on July 3, 1998. On Microsoft's Web site XDR is also known simply as "schemas," with no mention of its full name, greatly adding to the confusion over schemas. Just remember that when Microsoft literature describes the use of schemas with IE5 or BizTalk, they mean XDR.

    Microsoft, IBM and independent consultant Tim Bray submitted the Document Content Description (DCD) schema proposal on July 31, 1999. It expresses document structure using the XML-based Resource Description Format (RDF). While neither Microsoft or IBM has shown any interest in following up with DCD or even RDF since then, Object Design's (now eXcelon Corporation) eXcelon product still uses the DCD format to store its own schemas.

    Before e-commerce software developers CommerceOne acquired Veo systems, developers at Veo submitted "Schema for Object-Oriented XML" (SOX) to the W3C on September 9, 1998. True to its full name, SOX makes mapping between element type declarations and object-oriented data structure definitions simpler and more straightforward than its predecessors do. The SOX proposal's frequent use of the term electronic commerce gives another clue about what kind of application development concerns drove its design.

    Finally, the xml-dev mailing list that gave the world the Simple API for XML (SAX, the standard event-driven API to XML documents) also submitted the Document Definition Markup Language, or DDML (also known as "XSchema" and "XSD" along the way), on January 19, 1999. Although no one ever implemented it, DDML indicated to the W3C where an important group of XML developers saw the priorities in schema language development.

    After receiving these proposals, the W3C took authors and editors from each of them and assembled a working group to put together their own schema proposal. After publishing a requirements document in February 1999, they released the first draft of their two-part proposal in May and the most recent in December. In my next article we'll take a look at some of the features in the W3C's proposal.

    About Bob DuCharme
    Bob DuCharme is an assistant vice president at Moody's Investors Service, where he oversees the implementation of SGML and XML systems. The author of XML: The Annotated Specification published by Prentice Hall, Bob received his master's degree in computer science from New York University.

  • YOUR FEEDBACK
    The Subversive wrote: Profitable 'cloud computing' companies – Salesforce.com (subscription), Digg (text based payloads), Google (text based payload). Unprofitable cloud computing companies – Gmail (text, pictures, video), YouTube (video), Flickr (photos), Hotmail (text, pictures, video), Plaxo (contacts). You get the picture. While always-available consumer and corporate data and content services are incredibly popular for a range of scenarios, they're incredibly expensive to host and maintain. In the end very few companies will be able to do this profitably, because it will be a tripartite equation: audience size, data payload size and egress/ingress rates and payment model. Anyone who fails to balance the three in a sort of perfect triple point will fail at it.
    Morph eXchange wrote: We are a SaaS enabler based in Asia, initially harvesting on the power of EC2.
    Jargoneer wrote: "Persistence as a Service" is another new one - anyone heard that one yet?
    XML JOURNAL LATEST STORIES . . .
    Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted to be...
    Red Hat CTO Brian Stevens, Citrix CTO Simon Crosby, Egenera CTO Pete Manca, Allen Stewart, Group Manager, Windows Virtualization at Microsoft, and Brian Duckering, Sr. Director of Products and Alliances at Symantec were the top industry executives who joined Jeremy Geelan in the 4th Fl...
    This article is aimed at beginner and intermediate Web developers looking to make the leap into database support of their Web site. The article suggests a new declarative language based on HTML-forms, which is used for development of the database interface. HTML forms can manage not on...
    ISO said Friday that the appeals made by Brazil, India, South Africa and Venezuela protesting the standardization of Microsoft’s Office Open XML (OOXML) file format hadn’t gone anywhere – it was unclear whether any of them had any standing anyway – but since they “failed to g...
    Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store.
    SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
    SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


    SYS-CON FEATURED WHITEPAPERS


    ADS BY GOOGLE
    BREAKING XML NEWS
    Altova® ( http://www.altova.com ), creator of XMLSpy®, the industry leading XML editor, and other ...