|
|
YOUR FEEDBACK
SOA World Conference
Virtualization Conference $200 Savings Expire May 16, 2008... – Register Today! Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Content Management
Maximizing the Usefulness of XML
By: Chris Brandin
Digg This!
Is XML documents or data? Should XML be managed with a database or a document management system? Should XML be managed at all, or is it simply a data interchange standard? These are among the most common questions people ask about XML when they're trying to get a grip on what XML is. These questions get at what we do with XML, not what it is. Another question: Is a computer a word processor or a video game? Neither? Both? Actually, a computer is more. Word processors are computers, as are video games; but it doesn't work the other way around. We cannot define a computer in terms of a single application, or even a multitude of applications for that matter. Similarly, we cannot restrict our definition of XML in terms of its uses.
XML is a standard for expressing information in a complete form. It contains not only data, but also context and attributes. XML is, in other words, informationally complete. That's why it's useful for so many different things. XML plays an important role in unifying information from disparate applications. It can likewise play a role in unifying disparate functions within applications. Does this mean that we can use XML for every aspect of what an application does - gather, present, store, manage, transport, transform, and process information? In a word - yes. There are a number of important advantages in doing this, among them:
As long as XML was used as a container for data it was sufficient to consider only syntax when building documents. To do more, we must consider grammar and style as well. Obviously, proper syntax is necessary for XML to be usable at all. Good grammar ensures that once XML information has been created, it can be subsequently interpreted without an inordinate need for specific (and redundant) domain knowledge on the part of application program components. Good style ensures good application performance, particularly when it comes to storing, retrieving, and managing information. Most application programs encompass the same basic functions - input, presentation, communication, processing, and information management. Although the same information underlies each of these functions, different data models have traditionally been employed to accommodate application components that accomplish these tasks. Even with XML, remodeling the same information in different ways for each component of an application is inefficient and yields programs that are buggy and difficult to maintain and change. A better way is to create a central, unified model for information that can adequately accommodate all functions of an application (see Figure 1). To create XML information models that can do more than accommodate single functions, we have to look at XML in a different way - as an information domain rather than simply a transport standard. For simple data transport applications we can use XML any way we like, as long as it is syntactically correct. To do more, complete, provable, and unambiguous XML models must be built. Creating robust universal information models in XML is easier than it may seem at first. The first step is to understand the basic patterns inherent to information expressed in XML. XML consists primarily of tags, attributes, and data elements. Tags provide context; in other words, tags describe what data elements in their scope are. Attributes provide information about or indicate how to interpret data elements in their scope. Data elements represent data in the traditional sense. The structure of XML also provides information - about hierarchy, groupings, relationships, etc. It is possible to create meaningless XML. For example, you could create perfectly correct XML by taking the entire text from a telephone book and simply putting "<Telephone_ Book>" at the beginning and "</Telephone_ Book>" at the end. It would be perfectly correct XML, but not useful XML. In this case XML is being used solely as a container for a block of data, and provides no context for the information contained therein. At the other extreme we have XML where all information is expressed in a semantically meaningful way. For example, consider the following XML fragment:
<Telephone_Book_Listing> The explicit patterns in the example above are:
Explicit patterns are typically converted to database columns with some or all of them being available as query terms. At least one XML information management system (NeoCore XMS) automatically organizes itself around these natural XML patterns and indexes them all without the need for any database design. Implicit patterns are used to determine groupings, relationships, and sometimes convergence points for query set intersections. Much like a Web browser can determine how to display HTML information based on presentation metadata embedded in the HTML, application components can determine how to treat and interpret XML information based on semantic metadata embedded in it. XML can contain any kind of metadata. This is where XML differs from HTML. Where HTML was targeted to a single function - the presentation of information - XML fulfills a more universal purpose - the complete characterization of information. Key to creating useful XML is to create semantically meaningful XML first. The easiest way to do this is to simply create XML representations that are easy for humans to read and understand, which is what we did in the sample XML fragment. If you were to create a manual entry form for the XML fragment it would probably look something like Figure 2. Note that the XML fragment is a direct and obvious analog of how you would represent the information on a manual entry form. All we have done is represent the preprinted parts as tags and the filled-in parts as data elements. The hierarchy we created is likewise obvious. This method of creating grammatically valid XML may seem so simple as to hardly be worth mentioning, but the fact is that grammatically valid XML is relatively rare. To illustrate why, we will examine some common mistakes programmers make when creating XML. These examples will use an application dealing with colorimeter readings. A colorimeter is a device that measures color using tristimulus readings utilizing a number of color models. For example, a colorimeter can be used to measure computer monitor colors using red-green-blue components of light (this being the "RGB" color model). This particular colorimeter can provide readings in multiple resolutions. A manual form for transcribing colorimeter readings might look something like Figure 3 after being filled in. A typical way to characterize this information in XML follows:
<colorimeter_reading>
Here, the information modeler has made a couple of optimizations in the interest of saving space. First, the "Color Model: RGB" field has been collapsed into the single tag, "<RGB>". This is a reasonable thing to do as the color model information arguably (and unambiguously) indicates what the readings are for, and therefore qualifies as true context. Second, the readings themselves have been collapsed into three attributes: "red=0", "green=255", and "blue=255". Expressing data elements as attributes is a common practice. Although syntactically correct, this approach brings a number of problems:
Another common way this form might be modeled is shown in Listing 1. Here, the entire contents of the form have been flattened into one level of hierarchy. There are several problems with this model:
Listing 2 shows the most literal XML representation of the colorimeter readings form.
Although this listing is grammatically correct, it is not optimal - nor is it particularly good style. This model represents another common practice: expressing context as data elements in name/value pairs rather than as tags. The weaknesses of this model include:
There's no one "perfect" XML information model for this application. By applying a few simple techniques, however, a model can be built that is unambiguous, performs well, and will serve all components of the application program. First we examine each information item in order to determine what it really is - data, context, or attribute. Going down the list of items in the original colorimeter reading form, the following interpretations and actions would certainly be reasonable:
Converting all this into an XML fragment yields the following:
<colorimeter_reading>
This model fulfills all the requirements of good grammar and good style, yet remains relatively terse:
So far, we have discussed creating semantically valid XML and how it applies to how application components interpret information. The next step involves how applications can be architected to leverage XML as a central information model in an unambiguous and provable environment. To accomplish this, XML must be used in a somewhat more stringent way than many application developers are accustomed to. From the discussions above we can establish a series of rules for modeling information in XML:
The last rule listed is very important and bears some explanation. XML is very flexible, in fact, too flexible the way it is often used. The one practice most responsible for making XML information models difficult to prove and control is the abuse of attributes. Developers regularly use tag/data element pairs and attributes interchangeably in the interest of avoiding data-bloat (an unnecessary practice because even trivial compression techniques can eliminate that problem). This brings an unfortunate ambiguity to information expressed in XML. Attributes are intended to provide information about or how to interpret items in their scope. If an application program encounters a tag that it does not understand, it will ignore everything in its scope, including all attributes. If an application program recognizes all the tags leading up to a data element, for example, it knows what the data element is; but if an attribute is encountered that is not understood, the application program may or may not know how to interpret the data element, and this situation is ambiguous. In order to enforce information integrity controls we need a provable, unambiguous mechanism. Attributes and XML Schema definitions provide suitable mechanisms to do this, as long as an exception is thrown whenever an attribute that is not understood is encountered. To control XML information we need to use a combination of attribute interpretation and schema validation. This gives us an arbitrary and fine-grained degree of control unachievable with traditional databases. This should be done once, with a centrally controlled mechanism. To accomplish this we need to architect enterprise applications in a new way. Traditionally, application programs serve as user interfaces and database systems manage and control data. In the XML world, a three-component architecture should be employed: the user-facing application, the XML information management system, and an information integrity enforcer (schema validator and attribute interpreter) (see Figure 4).
The information integrity enforcer can be implemented in a number of ways: as a server-side extension, as a standard application program component, or as a Web service. The trick is to have a common information integrity enforcer for each category of application. In order to have applications interact with XML information in a consistent and provable manner, the following model can be adopted:
Following these XML information modeling guidelines is not substantially more difficult than building semantically weak XML, and the payback can be considerable. Not only can XML be used to integrate disparate data sources, but it can be used as a unified means to express the information underlying functional components of application programs as well. This results in better application programs that can be built faster and maintained with less effort. One thing is inevitable: as information technology progresses into things like the semantic Web and more and more computer systems need to be integrated, information expressed in XML will have to become increasingly semantically valid in order for us to be able to keep up. XML already contains all the elements necessary to do this - we just have to be a little more thoughtful about the ways we use it. XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||