Welcome!

Industrial IoT Authors: Liz McMillan, Pat Romanski, Elizabeth White, Jeev Trika, Scott Allen

Related Topics: Industrial IoT

Industrial IoT: Article

Process SOAP with VTD-XML

Discover the benefits

SOAP is an XML based data protocol standardized by W3C for the purpose of enabling inter-application data exchange over the Internet. In a typical Web Services scenario, a SOAP message delivered via HTTP needs to be parsed before anything else can happen. As two popular SOAP processing methods, DOM and SAX/Pull force application developers to choose between performance/memory efficiency and ease of use. VTD-XML is the latest open-source, "non-extractive" XML processing API written in Java that overcomes many problems and issues of the status quo. The combination of its high performance, low memory footprint, random access, incremental update, and inherent persistence simply means this: With VTD-XML, application developers can finally unleash to the fullest extent the power of SOAP.

To many application developers, Web Services are usually synonymous with SOAP over HTTP. While HTTP (hyper-text transfer protocol) has been around for over a decade, the real excitement of Web Services lies in the use of SOAP (Simple Object Access Protocol). Effectively a subset of XML, SOAP possesses some unique attributes that set Web Services apart from previous distributed computing technologies, such as DCOM and CORBA. For one, SOAP is open and human readable, meaning that programming SOAP is simpler and easier to grasp. And equally important is the fact that SOAP representation of data is loosely encoded. Applications communicating using SOAP are no longer restrained by the rigidity of schema, making possible the true de-coupling between application logic and wire format of data.

Current SOAP Processing Overview
Due to its textual nature, a SOAP message must be parsed into machine-readable form before it can be understood by software applications. There are two types of SOAP processing models widely in use today:

  • DOM (Document Object Model) it a tree-based XML processing API specification. Because DOM creates in-memory data structure precisely modeling data represented in XML and allows random-access, it is generally considered an easy and natural way of working with XML. But building a DOM tree consumes 5x~10x the memory of the XML itself, and incurs a non-trivial amount of processing cost, making DOM ill-suited for most high performance XML applications.

  • SAX/PULL are specifically created to tackle the memory and processing inefficiency of DOM, as both export low-level tokenizer interfaces and, by default, never keep the entire document in memory. As a result, SAX/Pull based XML processing incurs less memory overhead and can potentially process very large XML files. Unfortunately they are also more difficult to use than DOM for precisely the same reason. Unless users build their own custom object model, SAX/:Pull don't offer random access, and force users to scan the XML document multiple times, making performance improvements over DOM insignificant. What's more, SAX/Pull programming interweaves application logic with XML processing, resulting in awkward, bulky application code that is hard to maintain.

    So in a way, with current XML processing methods, it is difficult to get both high processing/memory efficiency and ease of use. But there is more to think about.

    Right now, parsing SOAP messages, whether the application uses DOM or SAX, is pretty much inevitable, even if it is done repetitively to the same messages. Would it be nice if there is a pre-parsed form of XML directly reusable without the overhead of parsing every time?

    Also consider modifying the text content of the following XML file.

    <color> red </color>

    Using DOM, it would require at least the following three steps: build the DOM tree, navigate to and then update the text node, write the updated structure back into XML. So no matter how trivial the modification is, there is a round trip penalty of parsing the document and writing it back out. What if it is only a snippet buried within a big document? Would it be nice to be able to surgically remove then insert the update "in-place?"

    VTD: A Simple Solution
    Historically, the first step of text processing is usually to tokenize the input file into many little null-terminated strings. But there is another way to tokenize. Rather than extracting the token content out of the input, one can instead retain the original document intact in memory and use the offsets and lengths to describe tokens. In other words, tokenization can be done "non-extractively." We can look at how this "non-extractive" tokenization approach works in practice and compare it with traditional "extractive" view of tokens in the context of some common usage scenarios.

    1. String comparison- Under the traditional text-processing framework, C's "strcmp" function (in <string.h>) compares an "extractive" token against a known string. In our new "non-extractive" approach, one can simply use C's "strncmp" function in <string.h>.
    2. String to numerical data conversion- C's "atof" and "atoi" convert strings into numerical data types. One can introduce new functions or macros to convert "non-extractive" tokens into integers or floats. For example, the new "atof_ne" would have to take three inputs: the character pointer, the starting offset, and the length. Notice that the character pointer points at the memory buffer in which the entire document resides.
    3. Trim- To remove leading and trailing white spaces of "non-extractive" tokens, we only need to re-compute the offsets and lengths based on their older values. To do the same thing to extractive tokens usually involves creating new tokens.
    How to store offsets and lengths is the next question to think about. The handy way is to store them as member variables of objects. In a way objects are nothing more than small memory blocks filled with bits also known as member variables. But in the strictest sense small memory blocks filled with bits aren't necessarily objects. Consider a MIPS instruction that uses 32 bits to encode both op-code and operands. Also several segment registers in X86 architecture encode many parameters in 64 bits.

    Above considerations have led to the design of a "non-extractive" token encoding specification called Virtual Token Descriptor (VTD). A VTD record is a 64-bit integer that encodes the length, the starting offset, the token type and nesting depth of a token in XML. For certain types of tokens, the length field further encodes the prefix length and qualified name length, since both share the identical offset.

    One immediate benefit of VTD's non-extractive tokenization is that, because the document is kept intact, VTD allows applications to surgically insert and remove XML content similar to manipulating a byte array. For example, removing or changing the value of an attribute value or text content is the same as skipping the segment marked by the offset and length containing "unwanted" text. Also VTD makes possible the removal of entire element by simply skipping it according to its offset and length.

    Introduce VTD-XML
    Built on the concept of VTD, VTD-XML is the latest open source, "non-extractive," Java-based XML processing API (VTD-XML) ideally suited for SOAP processing. Currently it supports only five built-in entities (& < > &apos; "). The latest VTD-XML is version 0.8, which can be download here (http://vtd-xml.sf.net). Aside from maintaining the XML file intact in memory and exclusively using VTD to describe tokens, VTD-XML also introduces the concept of location caches that provide efficient random access. Different from DOM, VTD-XML's notion of hierarchy consists exclusively of elements, which essentially correspond to VTD records for starting tags. Resembling the index section of a book, location caches again make extensive use of 64-bit integers. The project web site (http://vtd-xml.sf.net) has an in-depth description on how VTD-XML achieves the purpose of random access with location caches.

    VTD-XML should exhibit the following characteristics when used in a Web Services project. First, it parses SOAP messages at the performance level equivalent, if not faster, than SAX with the NULL content handler. On a 1.5 GHz Athlon processor, VTD-XML processes SOAP message at around 25~35 MB/sec. Second, unlike SAX, VTD-XML offers full random access as the entire parsed XML is resident in memory. Furthermore, if you are one of the developers that finds DOM's node-based API verbose and difficult to use, you should find VTD-XML's API clean and easy to comprehend. And VTD-XML's memory requirement is about 1.3x to 1.5x the size of XML, with 1 being the document itself as it is part of the internal representation of VTD-XML. Plus, incremental, dynamic update to the XML content is much more efficient than either DOM or SAX.

    Why does VTD-XML consume less memory than DOM? In many VM-based object-oriented programming languages, per-object allocation incurs a small amount of memory overhead. VTD records are immune to the overhead because they not an objects. Also VTD records are constant in length and can be stored in large memory blocks, which are more efficient to allocate and garbage collect. For example, by allocating a large array for 4096 VTD records, one incurs the per-array overhead (16 bytes in JDK 1.4) only once across 4096 records, thus reducing per-record overhead to very little.

    And more importantly, VTD's efficient memory usage has strong implication on its performance. DOM is slow in a very large part because it is resource intensive. The spirit of VTD is this: one simply doesn't have to, and has every incentive not to, create strings objects because they are slow to create. Even worse, they eventually need to be garbage collected. VTD-XML is able to achieve SAX's performance level because VTD significantly reduces DOM's memory usage, thus leading to savings on both object creation and garbage collection.

    At the top level, VTD-XML provides three essential classes: VTDGen, VTDNav, and AutoPilot.

    • VTDGen parses the XML/SOAP messages into VTD records and location caches.
    • VTDNav is a cursor-based API allowing for DOM-like random access of the XML structure.
    • AutoPilot works with VTDNav and emulates the behavior of DOM's node iterator.
    The rest of this article demonstrates how to use VTD-XML to process a sample SOAP message.

    A Sample Project
    To process SOAP with VTD-XML, the starting point is a memory buffer filled with the content of XML/SOAP message. The sample message containing the purchase order (shown below) in the body section of the SOAP envelope. For simplicity reasons, the project assumes the message resides on disk. In real life, one is more likely to read the message off HTTP. (See Listing 1.)

    At the top level, this project has a single main method (shown below) that wraps all code with a single try catch block that takes care of various exception conditions for IO operation, parsing and navigation. (See Listing 2.)

    The following code parses the SOAP message. It first allocates a byte array, and reads into it the byte content of the SOAP message. Then, it instantiates VTDGen and passes to it the byte array. Next, it calls VTDGen's member method "parse()" to generate the internal, parsed representation of the SOAP message. Notice that "parse()" accepts the Boolean value of "true" to indicate the parsing is namespace-aware. (See Figure 3.)

    After parsing, the sample code obtains an instance of VTDNav and uses the namespace aware "toElementNS()" to move the cursor to various positions of the element hierarchy and prints out corresponding text values, or selectively pulls out the XML fragment at the cursor position. (See Listing 4)

    The code above concerning VTDNav has several points worth mentioning.

    1. There is one and only one cursor available, which can be moved using "toElement()" or "toElementNS()." Those methods return a boolean indicating the status of the movement. If true, the cursor is repositioned; otherwise, no movement on the cursor.
    2. Several member methods, such as "getAttrVal()" and "getText()", return an integer corresponding to the index value of the VTD record if there is one. -1 is returned if no such record is found.
    3. VTDNav performs string to VTD record comparison directly, avoiding the round trip of creating and de-allocate string object.
    4. VTDNav also performs VTD record to numerical data type directly for the same purpose.
    5. There is a global stack available so one can save, then quickly store the the saved cursor location.
    6. VTDNav also allows one to convert a VTD record into a string object. Use this carefully for reasons in 3.
    The final part of the project composes an invoice for the purchase order (shown below with changes in bold). The invoice looks quite similar to the PO so VTD-XML allows cutting and pasting of XML. (See Listing 5.)

    The code that composes the invoice is shown in Listing 6.

    The Road Map and a Quick Recap
    The other property of VTD-XML is that its internal representation is inherent persistent, making it possible to avoid parsing for repetitive read-only XML processing. This also makes possible an XML upgrade path that improves XML processing performance without losing human readability.

    As readers can see, VTD-XML, the new, non-extractive, Java-based XML processing API based on VTD, offers a number of benefits not found with existing XML processing APIs. The most significant one is that it simultaneously offers high performance, low memory usage, user-friendliness. Also it introduces the notion of incremental update. As XML makes inroads into IT and becomes increasingly indispensable in our lives, VTD-XML should find its way in more places and hopefully enable new exciting XML applications.

  • More Stories By Jimmy Zhang

    Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    @ThingsExpo Stories
    Early adopters of IoT viewed it mainly as a different term for machine-to-machine connectivity or M2M. This is understandable since a prerequisite for any IoT solution is the ability to collect and aggregate device data, which is most often presented in a dashboard. The problem is that viewing data in a dashboard requires a human to interpret the results and take manual action, which doesn’t scale to the needs of IoT.
    Internet of @ThingsExpo has announced today that Chris Matthieu has been named tech chair of Internet of @ThingsExpo 2016 Silicon Valley. The 6thInternet of @ThingsExpo will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
    SYS-CON Events announced today the Enterprise IoT Bootcamp, being held November 1-2, 2016, in conjunction with 19th Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA. Combined with real-world scenarios and use cases, the Enterprise IoT Bootcamp is not just based on presentations but with hands-on demos and detailed walkthroughs. We will introduce you to a variety of real world use cases prototyped using Arduino, Raspberry Pi, BeagleBone, Spark, and Intel Edison. Y...
    Much of IT terminology is often misused and misapplied. Modernization and transformation are two such terms. They are often used interchangeably even though they mean different things and have very different connotations. Indeed, it is somewhat safe to assume that in IT any transformative effort is likely to also have a modernizing effect, and thus, we can see these as levels of improvement efforts. However, many businesses are being led to believe if they don’t transform now they risk becoming ...
    CenturyLink has announced that application server solutions from GENBAND are now available as part of CenturyLink’s Networx contracts. The General Services Administration (GSA)’s Networx program includes the largest telecommunications contract vehicles ever awarded by the federal government. CenturyLink recently secured an extension through spring 2020 of its offerings available to federal government agencies via GSA’s Networx Universal and Enterprise contracts. GENBAND’s EXPERiUS™ Application...
    What does it look like when you have access to cloud infrastructure and platform under the same roof? Let’s talk about the different layers of Technology as a Service: who cares, what runs where, and how does it all fit together. In his session at 18th Cloud Expo, Phil Jackson, Lead Technology Evangelist at SoftLayer, an IBM company, spoke about the picture being painted by IBM Cloud and how the tools being crafted can help fill the gaps in your IT infrastructure.
    SYS-CON Events announced today that LeaseWeb USA, a cloud Infrastructure-as-a-Service (IaaS) provider, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. LeaseWeb is one of the world's largest hosting brands. The company helps customers define, develop and deploy IT infrastructure tailored to their exact business needs, by combining various kinds cloud solutions.
    The best-practices for building IoT applications with Go Code that attendees can use to build their own IoT applications. In his session at @ThingsExpo, Indraneel Mitra, Senior Solutions Architect & Technology Evangelist at Cognizant, provided valuable information and resources for both novice and experienced developers on how to get started with IoT and Golang in a day. He also provided information on how to use Intel Arduino Kit, Go Robotics API and AWS IoT stack to build an application tha...
    Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
    It’s 2016: buildings are smart, connected and the IoT is fundamentally altering how control and operating systems work and speak to each other. Platforms across the enterprise are networked via inexpensive sensors to collect massive amounts of data for analytics, information management, and insights that can be used to continuously improve operations. In his session at @ThingsExpo, Brian Chemel, Co-Founder and CTO of Digital Lumens, will explore: The benefits sensor-networked systems bring to ...
    "Tintri was started in 2008 with the express purpose of building a storage appliance that is ideal for virtualized environments. We support a lot of different hypervisor platforms from VMware to OpenStack to Hyper-V," explained Dan Florea, Director of Product Management at Tintri, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
    Identity is in everything and customers are looking to their providers to ensure the security of their identities, transactions and data. With the increased reliance on cloud-based services, service providers must build security and trust into their offerings, adding value to customers and improving the user experience. Making identity, security and privacy easy for customers provides a unique advantage over the competition.
    SYS-CON Events announced today that Venafi, the Immune System for the Internet™ and the leading provider of Next Generation Trust Protection, will exhibit at @DevOpsSummit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Venafi is the Immune System for the Internet™ that protects the foundation of all cybersecurity – cryptographic keys and digital certificates – so they can’t be misused by bad guys in attacks...
    Is your aging software platform suffering from technical debt while the market changes and demands new solutions at a faster clip? It’s a bold move, but you might consider walking away from your core platform and starting fresh. ReadyTalk did exactly that. In his General Session at 19th Cloud Expo, Michael Chambliss, Head of Engineering at ReadyTalk, will discuss why and how ReadyTalk diverted from healthy revenue and over a decade of audio conferencing product development to start an innovati...
    For basic one-to-one voice or video calling solutions, WebRTC has proven to be a very powerful technology. Although WebRTC’s core functionality is to provide secure, real-time p2p media streaming, leveraging native platform features and server-side components brings up new communication capabilities for web and native mobile applications, allowing for advanced multi-user use cases such as video broadcasting, conferencing, and media recording.
    Large scale deployments present unique planning challenges, system commissioning hurdles between IT and OT and demand careful system hand-off orchestration. In his session at @ThingsExpo, Jeff Smith, Senior Director and a founding member of Incenergy, will discuss some of the key tactics to ensure delivery success based on his experience of the last two years deploying Industrial IoT systems across four continents.
    There will be new vendors providing applications, middleware, and connected devices to support the thriving IoT ecosystem. This essentially means that electronic device manufacturers will also be in the software business. Many will be new to building embedded software or robust software. This creates an increased importance on software quality, particularly within the Industrial Internet of Things where business-critical applications are becoming dependent on products controlled by software. Qua...
    "There's a growing demand from users for things to be faster. When you think about all the transactions or interactions users will have with your product and everything that is between those transactions and interactions - what drives us at Catchpoint Systems is the idea to measure that and to analyze it," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York Ci...
    SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2016 Silicon Valley. The 19th Cloud Expo and 6th @ThingsExpo will take place on November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Interne...
    The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develo...