| By Keith Thomas, Michel Vulpe | Article Rating: |
|
| July 10, 2001 12:00 AM EDT | Reads: |
7,189 |
If XML is to have the same transforming effect on document-oriented applications as it has on transactional, business-to-business applications, it will have to be easily applied by authors who are experts in the subject matter, but have little knowledge of markup and less desire to learn it. This article describes an XML editor who does exactly that: it allows the rest of us to create valid XML in a familiar context without extensive knowledge of the rules of XML or the DTD behind the document.
Think back to the last time you used your word processor. Whether it was a few minutes or a few months ago, you probably focused on finding the right words to express your ideas. What you probably took for granted was the application of markup, or the instructions embedded in the sentences and paragraphs you wrote, that tell the word processor how to present and process the characters you typed.
The current thinking behind XML content creation tools is that users should know about markup and the rules for applying it: the marketing materials for mainstream XML editors place more emphasis on the structuring process than on content creation. We have already seen XML prove itself by making it easier for businesses to automate the exchange of information during transactions. But the application of markup during these processes is often automated when data is extracted from relational databases or concealed from users through a forms interface.
With documents, most organizations are locked into practices established by SGML and HTML: markup is either applied by a few specially trained writers as content is created, or applied by a few specially trained editors after creation. This makes some sense if the markup controls only the rendering of finished content in Web browsers or other publication tools. But XML offers to document-oriented business processes the same kinds of efficiencies it has delivered in transactional applications. It can provide extra dimensions of semantics and structure that can be used by computer programs to organize and manage the content, to make it more easily accessible and easily reused.
Documents that carry the bulk of high-value organizational knowledge are often created in a collaborative fashion and developed through multiple drafts. Examples include technical specifications; operating and maintenance manuals; market, financial and business case analyses; legislation and public policy; contracts; patent applications; regulatory and court filings; clinical practice parameters and case histories; and educational and training materials. The content in these documents is generally reused to create new versions, or for different purposes. Capturing this content in XML greatly aids in automating the management of reuse over its entire life cycle, leading to a tighter, collaborative integration of information repositories, business applications, and processes between an enterprise, its partners, suppliers, and customers.
This requires a large-scale deployment of XML content creation tools to the people who create the material. The problem is their time is expensive and their output valuable. Turning these people into "markup technicians" is akin to expecting them to add markup when using a word processor. Instead, it's critical to deliver XML to them in a way that supports their work, in language they understand.
'Tagless XML'
i4i's Tagless Editor addresses this problem. Tagless XML is, of course, from a technical perspective, an illusion because XML is all about tags and the rules for applying them.
Tagless XML is for humans. Tags provide computers with directives about content. When computers "talk" to computers, they don't care about visual clutter. Precision takes precedence. Humans care about visual clutter and processes that, to a computer, are imprecise. Tagless XML recognizes that if humans are to interact with computer systems using XML, then XML had better learn to present itself in a people-friendly manner.
Tagless Editor is an application developed with i4i's S4/TEXT, a developers' platform for the creation of custom XML applications in Microsoft Word. Feedback from many organizations indicates users are highly resistant to new software that requires them to structure content in XML. Therefore, Microsoft Word is the most logical application to deploy XML en masse because it already sits on approximately 80% of desktops worldwide.
This said, the solution is not as simple as outputting Microsoft Word files in well-formed XML. The value of XML to an enterprise is in how easily it transports data between departments and companies. Key to this is the ability of the receiving databases, workflow systems, etc., to interpret XML data. For this to happen reliably, the XML content has to be valid, adhering to the requisite document type definition (DTD) that specifies a document's grammar and semantics, which take the form of tags inserted into the content.
The DTD is developed with the "system" in mind. For example, a DTD specifies how content is to be organized so when it is delivered to another information system the content can be processed. Compliance with the DTD ensures these systems work as programmed. The problem is that writers, managers, lawyers, clerks, customers, partners, suppliers - the nontechnical users creating content for an enterprise - have little immediate interest in improving enterprise efficiency by adhering to a DTD. And, despite the fact that DTDs are essential to the dynamic information-sharing promise of XML, they are primarily architectural exercises focusing on the structural problem of XML tag application. They don't address the core human interface problem - people simply don't want to apply tags.
Tagless XML in the Real World
Making DTD-specific documents behave in a people-friendly way is precisely the problem the United States Patent and Trademark Office (USPTO) faced when it implemented the Patent Application Specification Authoring Tool (PASAT), a customized Tagless Editor.
To download and try out this application, you should visit www.uspto.gov/ebc/efs/downloads/downloadndx.htm.
The USPTO needed:
- Thousands of inventors and patent lawyers to submit their applications electronically in XML.
- A solution that didn't require patent authors to purchase and learn special-purpose technologies to submit patent applications, but instead allowed them to use Microsoft Word.
- Met the USPTO's need for valid XML, and the patent authors' need for the Microsoft Word interface, thereby saving the agency more than $25 million.
- Masked the XML tags by changing the behavior of certain interface components to support the business processes involved in authoring a patent application.
User-Friendly Structural Tags
PASAT ties USPTO-specific presentation tags to the appropriate Microsoft Word event. For instance, clicking the "B" button in Microsoft Word causes selected content to be bolded and the USPTO tags <emphasis> and </emphasis> to be inserted. In other areas, this formatting is context-sensitive. For example, in a patent specification the structure for a <specification-block> includes the tags<cross-reference-to-related-applications> followed by <federal-research-statement>, while the structure for <claims> includes <heading> followed by <claim>. The "Enter" key is mapped differently in each case: in the first, to automatically enter the <cross-reference-to-related-applications> tags; in the second, to automatically enter the <heading> tags. The role of these tags is as predictable as the role of a paragraph marker so there is no reason why the system can't insert them for the author.
User-Friendly Semantic Tags
What remain are the semantic tags requiring input from the user before they can be applied. These tags generally describe the nature of a piece of content. The solution is to leverage these tags to create a guide to assist the user in the authoring process.
The key is to frame the dialog about the XML tags as business questions. If the business problem is that graphics are required, the dialog behavior of the system should be structured around selecting an appropriate graphic, not around the XML tags. PASAT uses Microsoft Word's "Paperclip" Office Assistant to make suggestions. The user's selection tells the software how to tag the content.
In the case of a patent application, artwork to substantiate a claim is, according to editorial rules, always at the end of a document and the XML DTD is structured to reflect that structure. However, from the author's perspective artwork is part of an individual claim. The XML DTD is consistent with the editorial rule, but inconsistent with the author's process. To overcome this, the business dialog presented to patent authors when they are working in the claims section of the document includes an option to add artwork. If this option is selected, PASAT creates the valid XML and opens the "File/Find" dialog box. Once the user selects the appropriate item, PASAT automatically inserts the artwork, specified by the author, into the section specified by the DTD with the appropriate tags. At no point in this process is the user presented with <artwork> tags or the XML logic.
The business interface frames the user interaction with XML as business decision support. It is not restricted to XML; rather, it combines XML with the business processes inherent in the document structure that is driven by the XML (see Figure 1). Further, it does not slavishly follow the XML rule set. It recognizes that business processes can involve data identified by XML tags, which, for editorial or other reasons, are not serial, and that those reasons are inconsequential to the user. Most important, it recognizes that the content creation system, not the user, should bear the burden of the XML.
The Developer's Point of View
For the user, the essence of the Tagless Editor is "Familiar Interface, Simple Paradigm." For a document process administrator, however, the essence is easy customization. If the markup is to be concealed behind a façade representing the document in user terms, then it must be easy for an administrator to create new document types for specific business processes. This is built into the architecture of the product.
Figure 2 outlines the architecture of the S4/TEXT Tagless Editor.
Microsoft Word provides the user interface and the basic word processing functionality, but when operating with S4/TEXT Microsoft Word is under the complete control of the S4/TEXT Tagless Editor.
The data services layer comprises the parser and the S4 markup engine. The parser is James Clark's SP, which is particularly fast and reliable and handles both SGML and XML, wrapped to interface with our S4 markup engine. S4 provides an efficient in-memory representation of an instance of content in markup for searching and manipulation. The reader familiar with XML standards will ask, "Isn't that what the Document Object Model (DOM) does?" Yes, but S4 is not limited to XML, it also handles SGML. It was developed before any work was done on the DOM standard, and offers a richer set of capabilities. It is based on i4i's patented markup management software and includes all the features needed to search and manipulate the elements and attributes of an instance in markup, plus full dynamic validation against a DTD. It also enables an instance in markup to be processed as both a conventional tree structure and as a linear sequence of content data without embedded markup.
The DataPipe layer holds S4/TEXT. DataPipe is i4i's trademarked name for its software, based on its techniques for manipulating markup and content separately, that enables existing applications, in this case Microsoft Word, to use and produce data in markup. It creates and maintains a data structure that maps the data held in the host application to the data and markup held in S4, accommodating the fact that the host application may include content that is not intended to be included in the XML, and vice versa. A DataPipe is not intended to be used on its own, but rather to serve as a platform for developers to create custom applications through its COM API.
The primary function of S4/TEXT is to maintain a map relating the rendering of the document in Word to the XML instance in S4. It traps all keyboard and mouse events and determines which actions it should take and which to instruct Word to perform.
S4/TEXT uses a Word format template to determine which styles are to be applied based on the element name in the markup context of its specific enclosing elements. For example, a title in a section is in a different context than a title in a subsection, and can be treated differently. The styles are determined by XML. The user never selects a style; they are applied automatically. Styles can also be set up for element prefix and postfix text to supply captions in the Microsoft Word rendering that are not part of the content, and that cannot be edited, but which guide the user. It also can be used to conceal the content of various elements from the user, and to identify external viewers (ActiveX controls) to be used to display nontext inclusions (graphics, sound, etc.) referenced in the XML.
S4/TEXT also manages custom items on the menu and tool bars, including buttons for applying markup to selected text; a tag display toggle button, enabling the power user to display the XML tags in the Word view if desired; and a document map toggle button.
The application layer of the Tagless Editor is written in VBA. Its primary function is to serve as the root add-in to Word to invoke the lower layers and to interpret the behavioral instance (BI).
The BI is itself an XML document, with its own DTD, and specifies four important types of behavior for the Tagless Editor:
- Context-sensitive suggestions and options available to the user
- Actions to be performed automatically on the insertion of any element in context into the document
- Actions to be taken on specific key events (e.g., Enter) in context during creation or editing
- The required and optional elements that form the framework of the document on creation
When the user selects any option involving the insertion of a new element, the corresponding actions specified in the BI are invoked. These may perform one or more actions, such as:
- Positioning the element to be inserted in the next relative valid position
- Inserting multiple enclosing tags in the markup
- Invoking dialog boxes or forms for the user to fill in
- Invoking some external process or data source
When the user selects "New Instance" from the Word file menu, he or she is presented with a scrollable dialog box listing all of the document types registered on the user's machine. When the user selects a document type, a check-box list of required and allowed elements named in a user-friendly way appears in an adjacent panel. The required elements are shown in bold type with their check boxes already marked and they cannot be changed. The optional items have empty check boxes that the user can click to select the element. This display is derived from the BI instance for each document type. Once the user has selected the elements he or she wants, the Tagless Editor creates a framework of predefined headings in the Word display as defined in the BI. These headings may include noneditable captions (specified in the format template) or replaceable content that indicates what the user should enter in their place (specified in the BI). This framework means the user doesn't have to create the document in a strict top-down fashion, but can move about filling in sections at will.
A portion of the PASAT BI appears in Listing 1, showing the definition of some of the required and optional elements on document creation.
While the user can never insert an unknown element or entity, nor insert an element in an improper place, he or she can have a document that is technically invalid in that required elements have not yet been added. The "Validate" option on the Tools menu allows the user to check the document. If there are missing required elements, the user is notified of the first such occurrence; otherwise, the user is told the document is valid.
A number of additional features that are controllable from the BI include:
- A dialog to create tables (tables conform to the CALS Exchange model)
- A dialog to insert entities from the available repertoire
- A dialog to include various types of graphics
- A function to number paragraphs or other elements within the document
With the source code of the Tagless Editor (supplied with the product), a programmer can develop additional custom features, such as dialog boxes and forms to simplify the capture of highly structured content; support for interactive connections to external processes/databases; and integration with workflow and content management systems.
Published July 10, 2001 Reads 7,189
Copyright © 2001 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Keith Thomas
Keith Thomas joined i4i in 1997 as director of R&D. He is currently a product strategist. Keith has 35 years of experience in information technology. He is an author and adjunct member of the graduate faculty at the University of Toronto.
More Stories By Michel Vulpe
Michel Vulpe, chief technology officer and founder of i4i, has more than 15 years of experience in the information processing industry. He holds an MA from the University of Toronto.
![]() |
C. Race 07/19/01 05:09:00 PM EDT | |||
Hi: Interesting article. There is a tool called WorX (www.hvltd.com) that provides XML within Microsoft Word. We have used it in several projects and believe it is a better solution than i4i--thought you might want to take a look. |
||||
- Publishing Synergy: Blog, Twitter and Ulitzer
- Will PR Firms Survive The New Media Avalanche?
- Typhoon Ondoy (Ketsana) Hits the Philippines (Part 2)
- Confessions of a Ulitzer Addict
- My Thoughts on Ulitzer
- Cloud Computing Expo 2010 East to Attract More Than 5,000 Delegates in New York City
- GITEX TECHNOLOGY WEEK 2009 Exhibitor Profiles
- Cloud Computing Journal Continues To Publish World's Best Cloud Analysts
- Are You Comfortable With Where Your Data Sleeps at Night?
- CIA Falls for Cloud Computing in a Big Way
- Managing Cloud Applications
- Dr. Leslie Lenert of CDC Speaks on Healthcare IT
- Cloud CEOs, CTOs & SVPs to Speak at 4th International Cloud Computing Expo
- Publishing Synergy: Blog, Twitter and Ulitzer
- Will PR Firms Survive The New Media Avalanche?
- Typhoon Ondoy (Ketsana) Hits the Philippines (Part 2)
- Confessions of a Ulitzer Addict
- My Thoughts on Ulitzer
- Combining the Cloud with the Computing: Application Delivery Networks
- Ulitzer vs. Ning
- Cloud Computing Expo 2010 East to Attract More Than 5,000 Delegates in New York City
- GITEX TECHNOLOGY WEEK 2009 Exhibitor Profiles
- Cloud Computing Journal Continues To Publish World's Best Cloud Analysts
- Are You Comfortable With Where Your Data Sleeps at Night?
- Where Are RIA Technologies Headed in 2008?
- AJAX World RIA Conference & Expo Kicks Off in New York City
- JSON vs XML - A Jason vs Freddie Sequel
- Processing XML with C# and .NET
- Has the Technology Bounceback Begun?
- BPEL Processes and Human Workflow
- The Top 250 Players in the Cloud Computing Ecosystem
- Open Source Database Special Feature: An Introduction to Berkeley DB XML
- "HP's Problem Ain't the SAP Install," Says Sun's Schwartz
- eXist - An Introduction To Open Source Native XML Database
- Digitizing the Planet: Google Earth vs MSN Virtual Earth vs MapQuest
- Generating XML from Relational Database Tables






























