Industrial IoT Authors: Pat Romanski, William Schmarzo, Elizabeth White, Stackify Blog, Yeshim Deniz

Related Topics: Industrial IoT

Industrial IoT: Article

XML in Transit: Encoding Data

XML in Transit: Encoding Data

I just came back from the first face-to-face meeting of the W3C working group on XML Protocol (is it just me, or is the name somewhat odd-sounding?), and I'm wondering what topics to exclude from this column. Yes, that's right - exclude. Encoding data in XML is a difficult topic for many reasons. First, it's one of those technical subjects in which you need to look at lots of XML instance/schema/DTD snippets. Second, the devil is very much in the details and there are lots of them. Last but not least, there are as many ways to encode data in XML as there are data encoding needs. With this caveat, let's dive in. Keeping with the spirit of the column we'll touch on issues that are most relevant to XML protocols.

Have Protocol, Will Move Data
Imagine that some good people have developed a flexible and extensible XML protocol that can work with arbitrary data encoding styles. For example, SOAP defines an attribute in the envelope namespace - SOAP-ENV:encodingStyle - whose value is a URI identifying a particular encoding style. The encoding style applies to the element associated with the attribute as well as its content, excluding any child elements decorated with an encoding style specifier. For a quick refresher, peek at the following code:

<x:UpdateStock xmlns:x="Some URI">

There are many data transport scenarios and many possible data encoding styles that can be used with them. To put some structure to the discussion, think of the decision space as a choice tree. A choice tree has yes/no questions at its nodes and outcomes at its leaves (see Figure 1).

XML Data
Probably the most common choice involves whether the data is already in (or can easily be put into) an XML format. If we can represent the data as XML, we need only to decide how to include it in the XML instance document that will represent a message in the protocol. Ideally, we could just mix it in amid the protocol-specific XML, but under a different namespace (as shown in the previous code snippet). There are several benefits to this approach:

  1. The message is easy to construct and process using standard XML tools.
  2. Its contents can be queried using XQuery.
  3. If need be, it can be transformed using XSLT.

There's a catch.... The problem has to do with a seldom-considered but important aspect of XML - the uniqueness rule for ID attributes. The values of attributes of type ID must be unique in an XML instance so that the elements with these attributes can be conveniently referred to using attributes of type IDREF (following code snippet). (For more information on the uses of ID/IDREF read "Eliminating Redundancy in XML Using ID/IDREF" [XML-J, Vol. 1, issue 4].)

<Target id="mainTarget"/>
<Reference href="#mainTarget"/>
If your data doesn't use ID attributes you can include it inline (textually) in the XML protocol message under a separate namespace. However, if you do use ID attributes you'll run the risk of violating the uniqueness rule. For example, in the following code both message elements have the same id. This makes the document invalid XML. And no, namespaces do not address the issue. In fact, the problems are so serious that nothing short of a change in the core XML specification and in most XML processing tools can change the status quo. Don't wait for this to happen.

<message id="msg-1">
A message with an attached <a href="#msg-1">message</a>.
<attachment id="attachment-1">
<!-- ID conflict right here -->
<message id="msg-1">
This is a textually included message.

There are two ways to work around the problem. If no one ever externally references specific IDs within the protocol message data, your XML protocol toolset can automatically rewrite the IDs and references to them as you include the XML inside the message (see code below). This will give you the benefits described above at the cost of some extra processing and a slight deterioration in readability due to the machine-generated IDs.

<message id="msg-1">
A message with an attached <a href="#id-9137">message</a>.
<attachment id="attachment-1">
<!-- ID has been changed -->
<message id="id-9137">
This is a textually included message.

However, if you can't do this, you'll have to include the XML as an opaque chunk of text inside your protocol message (see the following code). In this case we've escaped all pointy brackets, but we could have included the whole message in a CDATA section. The benefit of this approach is that it's easy and works for any XML content. But you don't get any of the benefits of XML either. You can't validate, query, or transform the data directly and you can't reference pieces of it from other parts of the message.

<message id="msg-1">
A message with an attached message that we can no longer refer to directly.
<attachment id="attachment-1">
<!-- Message included as text -->
<message id="id-9137">
This is a textually included message.

Binary Data
So far we've covered encoding options for preexisting XML data. But, what if you're not dealing with XML data? What if you want to transport binary data as part of your message instead? The commonly used solution is good old base-64 encoding (see Listing 1). On the positive side, base-64 data is easy to encode and decode and the character set of base-64 encoded data is valid XML element content. On the negative side, base-64 encoding takes up nearly 33% more memory than pure binary representation. If you need to move a lot of binary data and space/time efficiency is a concern, you might have to look for alternatives. More on this in a bit.

You may want to consider using base-64 encoding even when you want to move some plain text as part of a message because XML's document-centric SGML origin led to several awkward restrictions on the textual content of XML instances. For example, an XML document can't include any control characters (ASCII codes 0-31) except tabs, carriage returns, and line feeds. This covers both the straight occurrences of the characters and their encoded form as character references (e.g., &#x04;). (This caused me a lot of pain when I was creating WDDX; I still haven't gotten over it.) Further, carriage returns are always converted to line feeds by XML processors. It's important to keep in mind that not all characters you can put in a string variable in a programming language can be represented in XML documents.

Abstract Data Models
If you're not dealing with plain text, XML, or binary data, you probably have some form of structured data represented via an abstract data model. (Both the SOAP specification and the XML Protocol materials use the term nonsyntactic to mean abstract; don't let this nondescript use of language scare you.) Usually abstract data models are ultimately instantiated as programming language data structures. A commonly used abstract data model is the directed labeled graph (DLG). A DLG consists of named nodes and directed named edges that connect source nodes with destination nodes. A node may have more than one edge with the same name. Nodes can have any number of useful properties - such as type - that don't fundamentally change the data model as they themselves can be expressed via nodes and edges.

All programming language and database data structures can be expressed as DLGs. Therefore, if we have a good way to represent DLGs in XML, we have a generic mechanism for handling abstract data models. We need three things:

  1. Given metadata about an abstract data model, we should have a way to map the model to a DLG model and construct an XML schema from it.
  2. Given an instance graph of the data model, we can generate XML that conforms to the schema. This is the serialization operation.
  3. Given XML that conforms to the schema, we can create an instance graph that conforms to the abstract data model's schema. This is the deserialization operation. Further, if we follow serialization by deserialization, we should obtain an identical instance graph to the one we started with.

As with many things in the XML industry, several specifications address this space. XMI, described in the XML-J article "UML, MOF, and XMI" (Vol. 1, issue 3), offers one mechanism. SOAP defines its own set of encoding rules that are fairly detailed and rather complex. In fact, they take up about 50% of the volume of the specification. The other 50% covers the envelope framework, header/body structure, extensibility mechanisms, intermediaries, error handling, RPC conventions, and HTTP bindings. We won't go into the details; there are too many of them. Suffice to say, in many cases you'll never have to worry about the mechanics of the serialization/deserialization processes. The following code gives you a taste of how the instance data looks, while Listing 2 shows you a possible schema for the data. The instance data markup can appear inside both the headers and the body of a SOAP message.

<name>XML Guru</name>
<comment href="#comment-1"/>
<contactNumbers SOAP-ENC:arrayType="x:phoneNumber[2]">
<phoneNumber>617.555.1212</phoneNumber >
<phoneNumber >415.555.1212</phoneNumber >
<x:comment id="comment-1" xsi:type="SOAP-ENC:string"> The one true XML guru. </x:comment>

As you can see, a lot is going on here. First, it's clear that the SOAP encoding model depends heavily on XML Schema. ID/IDREF attributes are used to handle multiple references to the same piece of data. The xsi:type attribute can be used to provide type information to the XML processor in the absence of a schema. For some types, notably sequences/arrays, you need to subclass predefined data types. In addition, array content information (SOAP-ENC:arrayType) must be stored in the instance data; pity the array structure syntax is not XML.

Pretty much any data can be encoded; there are no limits on the types of objects that can be represented. The schema fragment could have been autogenerated by introspecting some Java classes, for example. There are also ways to encode data without having to worry about the schema at all, using self-describing element names.

Linking Data
So far we've only considered scenarios in which the encoded data is part of the XML document describing a protocol message. This may create some problems for including preexisting XML content and waste space in the case of base-64 encoded binary objects. The alternative would be keeping the data outside the message and somehow bringing it in at the right time.

There are two general mechanisms for doing this. The first one comes straight out of XML 1.0. It involves external entity references that allow content external to an XML document to be brought in during processing. Many people in the industry prefer pure markup approaches and therefore favor using explicit link elements that comply with the XLink specification. Both methods could work. Both require extensions to the existing XML protocol toolsets.

Of course, there are purely application-based methods for linking. You could pass a URI known to mean "get the actual content here." However, this approach doesn't scale to generic data-encoding mechanisms because it requires application-level knowledge.

External content can be kept on a separate server to be delivered on demand. It can also be packaged together with the protocol message in a MIME envelope. In this case the links to it should probably use the MIME unique-content IDs (CIDs) for identification purposes. Traditionally, SOAP has steered clear of anything having to do with MIME. On the other hand, the ebXML Transport/Routing and Packaging working group is looking very seriously at multipart MIME messages. This historic difference is understandable when we consider that SOAP grew out of RPC work and the ebXML folks are focused on business messaging where, for example, an auto insurance claim might carry along several accident pictures. MIME offers a mechanism to combine the XML protocol message with the external content in a single package.

Choose Wisely
There are many ways to encode data in XML, and well-designed XML protocols will let you plug any encoding style you choose. How should you make this important decision? First, of course, keep it simple. If possible, choose standards-based and well-deployed technology. Then consider your needs and match them against some of the important facets of XML data encoding:

  • Time efficiency: how fast can you serialize/deserialize data? This becomes particularly important in transaction-intensive systems. In some cases, if you know certain things about your data, you can use much higher performance encoding/decoding modules. For example, WDDX doesn't support directed labeled graphs; it only supports tree structures. However, because of this simplification, serialization and deserialization can fly.
  • Memory efficiency: how much memory do you need during serialization/deserialization? You may not care about this on an application server with 2Gb RAM, but do you expect handheld devices to be able to make requests to your server? In general this is a bigger problem during deserialization. DOM-based deserializers are the biggest offenders because they need to instantiate so many objects in memory. SAX-based deserializers can do a much better job. High-performance XML protocol frameworks, such as the Apache SOAP Project, are developing innovative approaches to combine the speed of SAX with the ease of access that the DOM provides.

  • Transport efficiency: how do the sizes of the generated XML compare between encoding styles? Packing multimegabyte JPEGs as base-64 strings inside XML documents may not be the best way to use bits on the wire. Explore external linking mechanisms when bandwidth is of concern. Also, consider protocol bindings that allow for compression.
  • Flexibility: Can you encode abstract data models? Is there a limit on the types of data that can be represented? Can you link external content? Is the encoding format introspectable (i.e., can someone do something meaningful with the data without having previously looked at its schema?)? This is important for service-discovery-type applications.

This space is evolving quite rapidly and the pending release of the XML Schema specification will add fuel to the fires of innovation. Fasten your seatbelts - there's little standardization in this space right now and there will be some turmoil before we emerge with sensible ways to approach the common data-encoding scenarios described here.

Although there's lots more ground to cover on this subject, I think I should move on quickly to try to stay on top of innovation in the XML protocol space. In the next XML in Transit column I'll take a look at the Web Services Description Language (WSDL), another hallmark joint effort by Microsoft and IBM. Keep it coming, guys.

More Stories By Simeon Simeonov

Simeon Simeonov is CEO of FastIgnite, where he invests in and advises startups. He was chief architect or CTO at companies such as Allaire, Macromedia, Better Advertising and Thing Labs. He blogs at blog.simeonov.com, tweets as @simeons and lives in the Greater Boston area with his wife, son and an adopted dog named Tye.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

IoT & Smart Cities Stories
The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
Druva is the global leader in Cloud Data Protection and Management, delivering the industry's first data management-as-a-service solution that aggregates data from endpoints, servers and cloud applications and leverages the public cloud to offer a single pane of glass to enable data protection, governance and intelligence-dramatically increasing the availability and visibility of business critical information, while reducing the risk, cost and complexity of managing and protecting it. Druva's...
BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
DSR is a supplier of project management, consultancy services and IT solutions that increase effectiveness of a company's operations in the production sector. The company combines in-depth knowledge of international companies with expert knowledge utilising IT tools that support manufacturing and distribution processes. DSR ensures optimization and integration of internal processes which is necessary for companies to grow rapidly. The rapid growth is possible thanks, to specialized services an...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...