YOUR FEEDBACK
NGASI Releases AppServer Manager 8.1
Dave Jenkins wrote: The remote server management is a welcomed added feature...
SOA World Conference
Virtualization Conference
$200 Savings Expire May 16, 2008... – Register Today!


2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Replace DTDs? Why?

Digg This!

Of all the standards to accompany XML that are currently in progress at the W3C, few are more anxiously awaited than the Schema standard - the specification that provides an alternative to XML 1.0 DTDs as a way to describe a document's structure. But what's wrong with XML 1.0 DTDs? How many alternatives have been proposed, and by whom? Why didn't the W3C address these concerns in the original XML 1.0 specification instead of waiting until now? I'll answer those questions in this column, and in my next column we'll look at the current state of the W3C Schema Working Group's unfinished proposal.

What Can They Do?
Just as a compiler can process the source code of a particular programming language more effectively if the program's data structures are declared up front, an XML processor is more efficient if it knows what kind of data structures to expect before it begins reading a document. XML 1.0 DTDs - which I will hereafter refer to as DTDs, although technically schemas express DTDs as well - can have five or six kinds of declarations, depending on whether you consider comments to be declarations (the XML spec is vague on this point):

  • Element type declarations: An element type is a named class of elements, such as h1, img or p in HTML or para or listitem in the DocBook DTD.
  • Attribute list declarations: An attribute list declaration lists the attributes for a given element type. The attribute list for HTML's img element type includes the src, alt and align attributes.
  • Entity declarations: Entities name collections of information that a DTD or document can reuse elsewhere. An entity may represent a single character of text, a string of text or a complete file sitting outside the DTD.
  • Notation declarations: When a DTD declares an external non-XML, or "unparsed" entity, it must identify the entity's format. A notation declaration tells the processor: "Here's a legal format for this document type's unparsed entities."
  • Comments: These look just like they look in HTML: <!-- like this -->. This is information for the parser to ignore.

    What's wrong with these?

    Weak Data Typing
    The most common complaint about XML from people who come to it from the database and programming worlds (as opposed to those coming from the SGML publishing and HTML Web design worlds) is the lack of data typing. When these developers declare or define a named piece of information - for example, a field in a database or a variable in some Java or C++ code - they're accustomed to naming its type and then assuming that the processing engine underneath their application will ensure that any information stuffed into that slot conforms to that type. Once they declare an XML Quantity or RetailPrice element type, they don't want to write extra application code to ensure that the strings between the start and end tags really are integers and currency figures. Extra error-checking code isn't just annoying to write; it adds fat to the thin clients that XML is supposed to be so great for.

    This wasn't a big deal in the SGML world because nearly every application was a publishing application. With XML's popularity in e-commerce development, data values like quantities and especially prices become more important. Although XML 1.0 offers a few types that help constrain attribute values, classic types such as integers, real numbers, Booleans and dates aren't among the choices, and application developers need them for element content as well as attribute values.

    Document Structure Not Stored in an XML Document
    DTD declarations have their own syntax that, despite using the "< >" angle brackets, is quite different from XML document syntax. Many newcomers to XML ask why XML isn't used to represent the structure of its own documents. The original answer was that XML was designed to be completely compatible with SGML, which had a larger base of applications and tools than most new XML users realize. These applications and tools played a big role in XML's initial jumpstart.

    Since then, a revision to the SGML standard allows for legal SGML documents without the DTD declarations used to specify document structure - that is, to have what the XML world calls "well-formed documents." If an XML document with no DTD can still be a legal SGML document, then the primary reason for using SGML DTD syntax no longer applies.

    Another argument against specifying DTD structure with XML elements was that it would be confusing to include elements that describe other elements right in there with the elements that they describe. As it turned out, no one does this anyway; schema documents are always kept separate from the documents they describe, and documents point to their schemas with a processing instruction, a namespace declaration or some other mechanism.

    Using XML elements to describe document structures has several benefits. It makes these structures much easier to develop because you can use any XML editor to edit and manipulate them - and I mean any XML editor, even the lame ones that merely dump your document to a visual tree and then write that tree back out when you save your document. (Paragraphs of text like the ones you're reading here are very cumbersome to edit on such an editor, but a schema document is naturally treelike.) Application development is also easier for documents whose structure is stored in a well-formed XML document, because applications have easier access to information about document structure. SAX and DOM, the two current XML API standards, offer very little to an application that wants to check DTD information such as an attribute's declared type or whether a particular element is optional. With document structure definitions stored in a DOM tree or triggering the same SAX events that the document's elements trigger, an application can find out all it wants about that structure.

    No Inheritance
    A key reason for XML's popularity among system developers is its ability to easily describe fairly complex data structures. You don't have to squeeze everything into tables; if you like, you can represent a data structure as a hierarchical tree or, with the help of ID and IDREF attributes, as a directed graph.

    One of the great features of the object-oriented world is the ability to define data structures as extensions of existing structures. With a well-designed hierarchy of object classes inheriting from each other, simple changes can affect as much or as little of this hierarchy as you wish.

    Developers with object-oriented experience appreciate XML's ability to define and manipulate complex data stuctures, but they know that specifying every detail of every data structure from the ground up isn't the most efficient way to develop a system. They want a way to base a new element type on an existing one.

    Potential Messiness of Parameter Entities
    A parameter entity is a string of text or an external file that's been named so that it can be easily plugged into DTDs. The former, an "internal parameter entity," may contain a few attribute declarations that you can reuse in the attribute list declarations of several element types; an external parameter entity could be a file whose declarations will be used in multiple DTDs.

    To keep the design of complex DTDs modular and maintainable, internal parameter entities sometimes build on each other in multiple layers, leaving you with references to parameter entities that have parameter entity references themselves - and those may refer in turn to parameter entities that contain more parameter entity references. Because it's all implemented using string substitution, it can get messy quickly.

    Specialized data structures suited to each of these purposes would give developers more robust components to mix and match when building a document type's structure.

    Weak Self-Documentation Facilities
    As with XML documents - and, for that matter, HTML documents - you can put comments in DTDs that the processor will ignore by putting them between the <!-- and --> delimiters. Like anyone else defining data structures, DTD authors have been encouraged to use these comments to explain the use of these data structures, but in keeping with tradition, they often skimp on this duty. Utilities do exist that compile reports on DTDs by examining the sibling and parent relationships of the various element types, but serious automation of documentation generation can only go so far because of the lack of clues about each comment's purpose. Java, on the other hand, offers the @fieldname notation to identify specific fields of information in the header of a class or method's source code, making it easier for an automated utility such as javadoc to create useful documentation easily with no human intervention.

    It's ironic that Java is better than XML at allowing automated documentation generation, for two reasons. First, a big factor in the popularity of SGML was the way it easily let developers create systems that automated the creation of print, Web, WinHelp and CD documentation. Second, the original idea for XML, like Java, came from Sun; it was Sun's Online Information Technology Architect Jon Bosak who put together the W3C Working Group that devised a simpler version of SGML that would work more easily over the Web.

    Replacement Candidates
    Three groups of W3C member companies and a mailing list devoted to cutting-edge XML issues each assembled alternatives to XML 1.0 DTDs and submitted them to the W3C. Each proposal addresses some or all of the problems described here. Just as schemas express DTDs as much as the SGML-like XML 1.0 style does, XML 1.0 DTDs also qualify as "schemas," but in common practice people refer to the XML 1.0 way as "DTDs" and the new ways as "schemas." In addition to the W3C's Schema proposal, you may have heard of eight other schema proposals, but really only four were submitted - other names refer to earlier names or subsets of these four.

    A group of eight authors, five of whom worked for Microsoft or DataChannel (a Redmond company that's done a lot of XML work with Microsoft) submitted the XML-Data proposal to the W3C on January 5, 1998, making it the only proposal to predate XML's ascent to Recommendation status. A simplified version of XML-Data known as XML-Data Reduced, or XDR, was submitted to the W3C on July 3, 1998. On Microsoft's Web site XDR is also known simply as "schemas," with no mention of its full name, greatly adding to the confusion over schemas. Just remember that when Microsoft literature describes the use of schemas with IE5 or BizTalk, they mean XDR.

    Microsoft, IBM and independent consultant Tim Bray submitted the Document Content Description (DCD) schema proposal on July 31, 1999. It expresses document structure using the XML-based Resource Description Format (RDF). While neither Microsoft or IBM has shown any interest in following up with DCD or even RDF since then, Object Design's (now eXcelon Corporation) eXcelon product still uses the DCD format to store its own schemas.

    Before e-commerce software developers CommerceOne acquired Veo systems, developers at Veo submitted "Schema for Object-Oriented XML" (SOX) to the W3C on September 9, 1998. True to its full name, SOX makes mapping between element type declarations and object-oriented data structure definitions simpler and more straightforward than its predecessors do. The SOX proposal's frequent use of the term electronic commerce gives another clue about what kind of application development concerns drove its design.

    Finally, the xml-dev mailing list that gave the world the Simple API for XML (SAX, the standard event-driven API to XML documents) also submitted the Document Definition Markup Language, or DDML (also known as "XSchema" and "XSD" along the way), on January 19, 1999. Although no one ever implemented it, DDML indicated to the W3C where an important group of XML developers saw the priorities in schema language development.

    After receiving these proposals, the W3C took authors and editors from each of them and assembled a working group to put together their own schema proposal. After publishing a requirements document in February 1999, they released the first draft of their two-part proposal in May and the most recent in December. In my next article we'll take a look at some of the features in the W3C's proposal.

    About Bob DuCharme
    Bob DuCharme is an assistant vice president at Moody's Investors Service, where he oversees the implementation of SGML and XML systems. The author of XML: The Annotated Specification published by Prentice Hall, Bob received his master's degree in computer science from New York University.

  • XML JOURNAL LATEST STORIES . . .
    3rd International Virtualization Conference & Expo: Themes & Topics
    From Application Virtualization to Xen, a round-up of the virtualization themes & topics being discussed in NYC June 23-24, 2008 by the world-class speaker faculty at the 3rd International Virtualization Conference & Expo being held by SYS-CON Events in The Roosevelt Hotel, in midtown
    Red Hat Named "Platinum Sponsor" of Virtualization Conference & Expo
    Red Hat is a trusted open source provider. Red Hat offers enterprise customers a long-term plan for building infrastructures on the quality and innovation of open source. Combining open source operating system platform, Red Hat Enterprise Linux, together with applications, management
    JustSystems Contributes Key XBRL Rendering Technology to Financial Community
    JustSystems announced that it is contributing intellectual property rights for its invention of eXtensible Business Reporting Language (XBRL) rendering technologies to XBRL International, the standards body responsible for the oversight of the XBRL specification. The invention, known a
    JustSystems Launches Campaign for XBRL Success
    JustSystems announced its campaign to help organizations adopt XBRL (eXtensible Business Reporting Language), the XML-based standard for communicating financial and business information. In related news, JustSystems also announced that it has contributed intellectual property rights of
    Virtualization Meets DaaS - Desktop-as-a-Service
    After a $1.5 million angel round, Desktone, which was started in 2006 by Eric Pulier, who also started SOA Software, US Interactive and IVT, picked up $17 million in first-round funding about a year ago from Highland Capital Partners, SoftBank Capital, Citrix Systems and the China-base
    SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
    SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

    SYS-CON FEATURED WHITEPAPERS


    ADS BY GOOGLE
    BREAKING XML NEWS
    RCG IT Addresses BI and SOA Convergence and Business Architecture at TDWI World Conference in Chicago
    RCG Information Technology, Inc. (http://www.rcgit.com/) will participate in The Data Wareho