Welcome!

Industrial IoT Authors: XebiaLabs Blog, Elizabeth White, Antonella Corno, Liz McMillan, Jyoti Bansal

Related Topics: Industrial IoT

Industrial IoT: Article

Structuring Documents With XML

Structuring Documents With XML

This month's tutorial, the second in a series, picks up where last month's left off - on the path toward publishing your résumé on the Internet as an XML document. Last month (XML-J, Vol. 2, issue 5) I presented an overview of XML, described its basic building blocks, and demonstrated how to create a simple XML document.

This month, after reviewing XML's fundamental components, I'll guide you through the process of marking up a résumé with XML. In doing so the column touches on the fundamentals of structuring and marking up data as well as some of the concepts - such as hierarchical trees, nodes, and parent-child relationships - that underlie XML documents.

My objective is to help you learn how to structure a document using XML. Toward that end I'll compare three approaches to tagging based on presentation, structure, and content, thus laying the groundwork for developing a tagging strategy for résumés. Next we'll turn to a quick discussion of hierarchical trees that will supply the terminology needed to address the abstract structure of XML documents. Then we'll review last month's hands-on work and expand on it while marking up a résumé in XML. Throughout the column I'll introduce you to several new language constructs, building on last month's tutorial.

XML's Building Blocks
First we'll review the XML fundamentals covered last month peppered with a few new constructs.

Remember A simple XML document must contain at least one XML processing instruction and one or more elements, all encased in angle brackets. Processing instructions start with <? and end with ?>. XML documents begin with a processing instruction that contains an XML declaration: <?xml version="1.0" standalone="yes"?>. The stand-alone attribute, which is optional and takes yes or no as its value, specifies whether an external DTD is required. If the value is yes, an external DTD is not required.

The XML processing instruction on the first line of your document may also define the character set used, and it's generally a good idea to include it. By default XML documents use the UTF-8 encoding of Unicode. But you may use the encoding keyword to insert a declaration that specifies that set or another set, as in the following example:

<?xml version="1.0" encoding="UTF-16"
standalone="yes"?>

All the gritty details about the available character sets, including which to use when, can be found in XML in a Nutshell, by Elliotte Rusty Harold and W. Scott Means (O'Reilly). Additional technical details about character encoding can be found at www.w3.org/TR/REC-xml#charencoding.

XML documents also typically include a document type declaration that begins with <!DOCTYPE and performs two main functions:

  1. References a document type definition, or DTD
  2. Identifies the document's root element
The document type declaration may also contain an internal DTD subset, which we'll talk about in a later column.

In the following declaration the root element is the word after <!DOCTYPE:

<!DOCTYPE resume>
The document type declaration, however, is not required for the document to be well formed. A well-formed document is one that adheres to the rules of XML syntax. A valid XML document, in contrast, is one that conforms to the constraints of a DTD.

Even though the root element has been specified in the declaration, it must still appear as the document's first element:

<resume>
Besides a root element, XML documents typically contain a hierarchy of nested elements. However, there are a few restrictions on the characters that may be used in element names, especially as their first symbol. In particular, element names must begin with an underscore or a letter in either upper- or lowercase, but never a number. The tag <2001Resume> isn't permitted. After starting with a letter or underscore, a tag may contain numbers as well as other letters, hyphens, underscores, and periods.

The use of colons is illegal unless you're specifying a namespace, which will be addressed in a later column. It's also illegal to begin an element name with the letter combination of xml in any variation of upper- and lowercase. And don't forget that the sequence of symbols used in your opening and closing tags must be exactly the same.

For more information about legal and illegal tags in element and attribute names, see Robert Eckstein's XML Pocket Reference (O'Reilly). The technical details about valid XML characters are available at www.w3.org/TR/REC-xml#charsets.

All elements, including the root, may optionally take one or more attribute-value pairs. XML documents may also contain comments; they begin with .

Remember, too, that an XML document must adhere to certain markup and syntax rules to be considered well formed. First, XML is case sensitive, and the name of an opening tag must match the name of its closing tag. Second, an empty element - one that contains no other elements or text - must have a closing tag that may be combined with the starting tag. Thus, an empty element can be marked up either as <phone></phone> or as <phone/>. Third, every nonempty element's opening tag must have a corresponding closing tag. If you open an nonempty element with <resume>, it must have a corresponding closing tag of </resume> that's properly nested, which brings us to our fourth rule - XML documents may not contain any overlapping tags. Whereas <h2><i>Headline</h2></i> might work in HTML, it won't in XML. Finally, all attribute values must be enclosed in either single or double quotation marks.

When we delve into the tutorial below and the review of last month's work, you'll see how the components and rules above are used to create XML documents.

If you need more information about what I just reviewed, you may want to spend a few minutes reading up on the basics of XML. If you don't have a copy of last month's XML-Journal handy, I suggest you read the following references: the first half of Chapter 1 in Brett McLaughlin's book, Java and XML (O'Reilly); and Chapter 1 plus pages 11-16 of Chapter 2 in XML in a Nutshell. More about constructing well-formed XML can be found in the XML Pocket Reference. Taken together these readings should bring you in-line.

Tag Talk
Before marking up an isolated document with tags, there are several key markup-related decisions to make: (1) choosing a convention for tag names, (2) deciding what information to capture in attributes as opposed to child elements, and (3) choosing an approach to markup.

Besides the obvious - consistency - choosing a convention for tag names should be guided by the following criteria:

  • Ease of reading: One of the W3C's stated goals for XML documents is that they're legible to humans (as opposed to machines) and reasonably clear. Your tagging scheme should reinforce XML's self-documenting capacity and not undermine its legibility. (For more on the goals of XML see www.w3.org/TR/REC-xml#sec-origin-goals.)
  • Simplicity and ease of usability and re-creation: In general, the simpler your naming convention, the easier it'll be to apply. An easy-to-remember naming format will make writing stylesheets and DTDs easier, too.
  • Compatibility with XHTML
  • The potential for reuse with or incorporation into preexisting document type definitions (DTDs), XSL stylesheets, and tag lists
  • Ease of use of tag content, especially attribute values, in target output
Since XML doesn't restrict you to a particular case or format (other than those outlined above), you're free to choose. But in the face of the above criteria, the four main possibilities - lowercase, uppercase, initial caps, and mixed case - are not equal. Consider the following possibilities:
1. <ELEMENT>
2. <Element>
3. <element>
Compatibility with XHTML rules out option 1. In XHTML all HTML tags must be lowercase. Besides, anything written in all capital letters, even tag names, is hard to read. Option 2 is a bit easier to read; however, if you mix your tags with XHTML tags, which is useful to do at lower levels in the hierarchy of traditional documents such as software manuals, especially those destined for publication on the Web, you'll also find your tags becoming inconsistent: the tags you define begin with an uppercase letter while the XHTML tags begin with a lowercase letter.

Option 3 then seems to be the choice that would ensure the greatest consistency, especially if you're considering using XHTML in your XML markup. Using lowercase tag names also increases the potential for reusing preexisting tag lists, XSL stylesheets, and DTDs, as most XML programmers seem to prefer lowercase tags. For instance, UltraEdit, a text editor, comes with an XML tag list containing tags in lowercase.

Complex element names and the addition of attributes force us to make more decisions. Consider these tags:

4. <elementname property="hard to read">
                                                                                                    
5. <element_name property="Easy to Read">
6. <elementName property="Easy to Read">
Option 4 is difficult to read, ruling it out. Options 5 and 6 are equal in readability and the potential for reuse with existing DTDs and stylesheets. Some XML programmers use option 5 while others use 6. Others use a hyphen instead of the underscore in option 5. For its tag list UltraEdit uses option 6. Still others use option 4, as can be seen by viewing the XML markup behind the XML specification itself. It's an interesting case study in XML markup. Take a moment to study it (in Internet Explorer version 5.0 or greater, go to www.w3.org/TR/2000/REC-xml-20001006.xml).

While elements are the principal means for structuring data, attributes are typically used to capture properties of elements, and their values further modify or set a value for the property, as this tag demonstrates: <desk color="blue">.

If you choose to set your element names in lowercase, it's best, I believe, to set attribute names in lowercase, too, fostering consistency. The case of attribute values, however, is a bit trickier. The deciding factor is how they'll be used. In narrative-oriented documents I often use attribute values to contain metacontent about elements names: <section type="Introduction">. Besides making the value easier to read amid other coding, capitalizing it fosters its reuse as a headline when the document is outputted through an XSL stylesheet.

Before coding your document you'll also need to decide what information to capture in attributes as opposed to child elements. The approach I use for traditional documents is to capture metainformation but not content in attributes. XML in a Nutshell addresses the question of using attributes versus child elements in Chapter 2, "XML Fundamentals." This chapter reinforces and expands on the concepts discussed in this column.

Markup Strategies
Choosing an approach to markup is another decision you should make before you begin. The three principal markup strategies are:

  1. Presentation-based tagging
  2. Structure- or publication-oriented tagging
  3. Content- or information-based tagging
The three approaches form a dichotomy, with structure-oriented tagging hugging the middle ground, as illustrated by these examples:
Presentation: <ital>damn</ital>
Structure: <emphasis>damn</emphasis>
Content: <expletive>damn</expletive>
The markup strategy behind HTML is based almost entirely on presentation. Tags such as <h1>, <i>, and <b> indicate how content should be presented through a browser. The motivation behind using XML, however, is that it allows you to separate content from presentation and to structure data based on meaning, resulting in data and documents that are easier to reuse, manipulate, and search. Using a presentation-oriented approach exclusively defies the purpose of XML. It's better to use either a structure- or content-based approach.

Structure-based tagging is a generic, flexible approach with a wide scope, most useful when exchanging documents within a discipline or across industries. Employing a loosely structured DTD, it emphasizes elements such as <section>, <subsection>, and <paragraph>. Additional information about content is often delegated to attributes: <section id="Introduction">.

Content-based tagging is a less flexible, custom approach with a narrow scope that's most useful when modeling content around clearly defined user needs, a unique class of documents, or both. Using a tightly structured DTD, it emphasizes the use of elements such as <introduction> and <explanation>.

In reality, however, most documents intended for publication on the Internet or intranet combine all three approaches. The higher levels of the hierarchy use content-based tags, the middle levels use structure-oriented ones, and the lower levels, especially at the clausal level, may, for expedience, use some presentational tags from XHTML.

Since the focus of this column is on creating XML documents for publication on the Internet, we'll use a combination of all three approaches and learn a bit about XHTML as we do so. But as you mark up documents in XML, you'll have to evaluate the structure of your documents and how they'll be used before you decide on your own approach. Just be sure you do some planning and design before beginning the markup process. David Megginson's book, Structuring XML Documents (Prentice Hall), offers a plethora of information and good advice about choosing an approach to tagging that works best for your document or project.

Hierarchical Trees
Last month I asked you to start getting your hands dirty with XML by using a text editor such as Notepad or UltraEdit to mark up your résumé, or part of it. I suggested that you use not only elements but also attributes and that you try to create tags describing the structure or content of your résumé. I also suggested that you think about what aspects of your résumé should be captured in attributes as opposed to elements. Marking up part of your résumé in XML and debugging it in Microsoft Internet Explorer 5.0 or greater should have resulted in the document being displayed like the hypothetical résumé fragment in Figure 1.

Notice how Internet Explorer's default view for an XML document reveals its hierarchical structure. The minus and plus signs allow the document to be displayed as a collapsible outline, indeed, as a collapsible tree. In XML, various concepts, most of which spring from the way we speak about trees or families, are used to express relationships within an XML document's hierarchical structure. At the base of its hierarchy each tree has a root element, which can be seen in Internet Explorer by clicking on the first minus sign of an XML document. From the root node stems a hierarchy of other branches and leaves. Leaves are terminal elements since they don't contain child elements.

Each element in a tree structure is called a node. Relationships among nodes are expressed using metaphors borrowed from families. The root node, for instance, is the parent of all the other nodes, called children; together they enter into a parent-child relationship. Although a parent element can have multiple children, each child node has exactly one parent node. XML and its accompanying specifications such as XPointer use such constructs as parent, child, sibling, ancestor, and descendant in keywords, expressions, and functions. For more information on parent-child relationships and related constructs, see XML in a Nutshell and XML Pocket Reference.

Tutorial
Bringing to bear the XML constructs and strategies discussed above, let's step through the coding of part of an actual résumé. It is, of course, a bit more difficult than the coding behind the hypothetical one shown in Figure 1, which, perhaps somewhat naively, uses content-based tagging exclusively.

The way we intend to use the résumé determines, to a certain extent, how we should mark it up. If you're building a Web site that collects and presents résumés from job seekers, you'd probably want to employ a different markup strategy from one used to code a résumé for isolated publication on the Web. My primary objective here is to structure the content of the résumé in such a way that I can use it as the source for different output formats, not only HTML but also plain ASCII text, Portable Document Format (PDF), and Wireless Markup Language (WML). I'd also like to keep the résumé's structure somewhat flexible in case I decide to add additional material, such as a listing of computer skills or references. To mark up the résumé I'll blend all three markup approaches using both content- and structure-oriented tagging complemented by a smattering of presentation-based tagging at the lowest levels. Throughout, I'm careful to avoid duplicating information.

I begin with the usual XML processing instructions and include a character encoding declaration. The standalone value of "yes" indicates that an external DTD is not required.

<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
Next comes the document type declaration, which specifies that the document's root element is résumé.
<!DOCTYPE resume>
The markup starts with the root element and branches into two high-level structure-oriented elements: header and section. I decide to use a repeatable structure-oriented section element instead of a set of nonrepeatable content-based tags such as experience and education since it enables me to add additional sections later.

To capture the content of each section, I use the value of the ID attribute.

<resume>
<header>
<name>Jane Doe</name>
<contactInfo>
<email>[email protected]</email>
<phone/>
<addresses>
<address type="Home">
<street>10 First Avenue</street>
<city>New York</city>
<state>New York</state>
<zip>10101</zip>
</address>
</addresses>
</contactInfo>
<portfolio>Online portfolio available
at <a href="http://www.JaneDoe portfolio.com"> www.JaneDoe
portfolio.com</a></portfolio>
<objective>To obtain a position as a <emph>content
author.</emph></objective>
</header>

<section id="Qualifications">
<head>Summary of Qualifications</head>
Instead of using a content-based tag like qualification for each item in the list of qualifications, I decide to simply borrow HTML's unordered list elements:
<ul>
<li>In-depth knowledge of multimedia design.</li>
</ul>
</section>
<section id="Positions">
<head>Experience</head>
<position>Content Author</position>
<employer>XYZ Multimedia Inc.</employer>
<duration>March 1998 through February 2001</duration>
<duties>
<ul>
<li>Created multimedia content for the company's Web site.</li>
<li>Used Photoshop to refine graphics created by other authors.</li>
</ul>
</duties>
</section>

<section id="Training">
<head>Education</head>

<education>

Because the year of graduation is not always displayed in a résumé but may still be useful information to have, it is encoded using an attribute:

<degree year="1997">MA</degree>
<subject>Photography</subject>
<school>University of Washington</school>
</education>
</section>
</resume>
Hands-on Work
This is, of course, just one possible way to mark up a résumé, not necessarily the best way. Deciding on how to build a data structure for a large set of résumés that will be made available on a Web site is a complicated task requiring consideration of a number of factors, including the wholesale avoidance of duplicate information, the flexibility to accommodate résumés written in different styles and with different content, and the capability to conduct specialized searches. Mark Wilson and Tracey Wilson, in Chapters 1 and 2 of their book, XML Programming with VB and ASP (Manning), provide additional examples about how to mark up a résumé or a collection of them in XML, but more important, they also explain the motivations for wanting to do so. W. Scott Means' article "Converting Unstructured Documents to XML," at http://xml.oreilly.com/news/xmlnut3_ 0301.html, demonstrates how to isolate elements to reveal a document's underlying structure. I recommend it.

To prepare for next month's column, analyze the way in which I structured the data in this résumé and identify what, in your opinion, I should have done differently. E-mail me with your point of view and the justification for it. But don't stop there. First, revisit the way you structured your résumé after reading last month's column; finish marking it up and debugging it in Internet Explorer 5.0 or greater if you haven't done so already. Second, conceptualize the rules that should constrain a résumé's data. To spur you down this path, I suggest you read pages 89-108 of Chapter 4, "Constraining XML," in Java and XML, written by Brett McLaughlin (O'Reilly).

Next month we'll dive headfirst into constraining XML data with document type definitions, or DTDs.

More Stories By Steve Hoenisch

Steve Hoenisch is a technical writer (consultant) with Verizon
Wireless. Before becoming a technical writer and a Web developer, he
worked as a journalist and teacher. Steve has been developing Web
sites since 1996.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at Dell EMC, introduced a methodology for capturing, enriching and sharing data (and analytics) across the organization...
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the USA and Europe, we work with a variety of customers from emerging startups to Fortune 1000 companies.
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will look at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deli...
The age of Digital Disruption is evolving into the next era – Digital Cohesion, an age in which applications securely self-assemble and deliver predictive services that continuously adapt to user behavior. Information from devices, sensors and applications around us will drive services seamlessly across mobile and fixed devices/infrastructure. This evolution is happening now in software defined services and secure networking. Four key drivers – Performance, Economics, Interoperability and Trust ...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
@ThingsExpo has been named the Most Influential ‘Smart Cities - IIoT' Account and @BigDataExpo has been named fourteenth by Right Relevance (RR), which provides curated information and intelligence on approximately 50,000 topics. In addition, Right Relevance provides an Insights offering that combines the above Topics and Influencers information with real time conversations to provide actionable intelligence with visualizations to enable decision making. The Insights service is applicable to eve...
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emergi...
SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help globa...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in compute, storage and networking technologies, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/...
Amazon has gradually rolled out parts of its IoT offerings in the last year, but these are just the tip of the iceberg. In addition to optimizing their back-end AWS offerings, Amazon is laying the ground work to be a major force in IoT – especially in the connected home and office. Amazon is extending its reach by building on its dominant Cloud IoT platform, its Dash Button strategy, recently announced Replenishment Services, the Echo/Alexa voice recognition control platform, the 6-7 strategic...
Judith Hurwitz is president and CEO of Hurwitz & Associates, a Needham, Mass., research and consulting firm focused on emerging technology, including big data, cognitive computing and governance. She is co-author of the book Cognitive Computing and Big Data Analytics, published in 2015. Her Cloud Expo session, "What Is the Business Imperative for Cognitive Computing?" is scheduled for Wednesday, June 8, at 8:40 a.m. In it, she puts cognitive computing into perspective with its value to the busin...
Cognitive Computing is becoming the foundation for a new generation of solutions that have the potential to transform business. Unlike traditional approaches to building solutions, a cognitive computing approach allows the data to help determine the way applications are designed. This contrasts with conventional software development that begins with defining logic based on the current way a business operates. In her session at 18th Cloud Expo, Judith S. Hurwitz, President and CEO of Hurwitz & ...
Cybersecurity is a critical component of software development in many industries including medical devices. However, code is not always written to be robust or secure from the unknown or the unexpected. This gap can make medical devices susceptible to cybersecurity attacks ranging from compromised personal health information to life-sustaining treatment. In his session at @ThingsExpo, Clark Fortney, Software Engineer at Battelle, will discuss how programming oversight using key methods can incre...