YOUR FEEDBACK
NGASI Releases AppServer Manager 8.1
Dave Jenkins wrote: The remote server management is a welcomed added feature...
SOA World Conference
Virtualization Conference
$200 Savings Expire May 16, 2008... – Register Today!


2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TODAY'S TOP SOA & WEBSERVICES LINKS


Trends in High Volume XML Publishing

Digg This!

Integrating efficient XML publishing into high-volume content environments remains a significant challenge. Among the many real-world barriers: the need to convert quantities of paper and other legacy documents and to integrate easy-to-use XML publishing tools into the content-creation process, and the lack of workflow management tools necessary for mass conversion environments.

In many environments content creators resist using XML authoring tools, preferring traditional word processing or desktop publishing applications, and simplified "template-style" DTDs are used to accommodate productivity requirements. Consequently, high-volume XML conversions are typically accomplished through "brute force" solutions, where mass OCR (optical character recognition) scanning and tagging are done through expensive outsourcing, often to developing countries where labor for repetitive high-volume publishing tasks is plentiful and inexpensive.

To realize the full benefits of XML for both highly structured and mixed-structure content in high volumes - without the cost, cycle time requirements, and other outcomes inherent in outsourcing - an XML publishing system that minimizes the ongoing intervention of XML programmers is essential. To efficiently convert documents in Word, HTML, PDF, RTF, or other formats into XML, intelligent, rule-based automated markup solutions are required.

For large-volume projects, efficient XML publishing also requires batch processing and workflow management solutions that optimize productivity. In this article I'll discuss the process requirements for high-volume XML creation and introduce new tools and technologies expressly developed for these mass-conversion environments.

The High-Volume XML Publishing Challenge
Any document can be represented in XML, but document types vary widely, creating diverse challenges for high-volume publishing requirements. Documents can possess data that ranges from highly consistent and repetitive to extremely diverse content structures that defy accurate digital representation. For example, accident reports, product catalogs, employment applications, financial forms, and other types of documents are very amenable to automated XML conversion solutions. On the other hand, dissertations, marketing reports, résumés, news articles, and other documents feature abstract intellectual information with highly diverse components and inconsistent composition.

For highly structured content, identifying and tagging variables can easily be automated through forms, scripts, and other techniques, but mixed-structure data requires an XML authoring tool or a postauthoring conversion process. The assignment of tags to elements in a document is fundamentally a separate and distinct exercise from the authoring process. Authors may intuitively recognize elements - such as a "phone number," "chapter heading," "ingredient," or "customer type," but their identification as an element requires a start tag, an element type, and an end tag delimited by brackets (<phonenumber>). This process can be simplified and accelerated, but it can't be eliminated.

Requiring content creators - knowledge workers such as technical writers, paralegals, insurance adjusters, law enforcement personnel, research professionals, and lab technicians - to perform manual tagging on original content is an unrealistic expectation in many environments. While considerable advancements have been made in XML authoring tools, they remain unappealing to many users who prefer standard word processing or desktop publishing software. Drop-down menus for tag selection, template support, advanced scripting, macros, and other modern enhancements to XML authoring systems still require manual tagging, an activity that is wholly separate from the document creation process.

In addition to the variability of document type, the thoroughness of document representation also may vary widely, depending on the application. It's possible to categorize a document simply by title, date, subject, and author, or render a document with hundreds of variables. The wide variability in content types, content sources, and data applications requires the customization of virtually every high-volume XML publishing system in order to achieve specific business goals.

High-volume XML publishing also frequently involves legacy documents in paper or electronic form. It's not uncommon for an XML publishing project to involve warehouses of paper in bankers' boxes, thousands of pounds of microfilm and microfiche, thousands of tapes or disks in obsolete, proprietary formats, and/or literally terabytes of PDF files. In many industries the events that initiate an XML project - such as mergers/acquisitions, new document management procedures, government regulations, and/or new business initiatives - are also the events that involve the greatest volume of archival information requiring XML conversion.

Key Requirements for a Mass Conversion Platform
For the full benefits of XML to be realized for high-volume conversion of unstructured content, a production platform will increasingly use intelligent automated solutions (autotagging) to achieve productivity and quality control objectives. In some applications these automated tagging solutions can coexist with scripts and templates, but the ultimate solution should be driven by content goals. Content authors will resist using XML tools or a template in many environments; consequently, content is often generated outside an organization where conformance to standard authoring procedures is impossible.

Another requirement for high-volume XML publishing is that the DTD/Schema structure should be determined by the user or application, not by the XML editing tool, production platform, or skills of the operators. In addition to high cost and long cycle time, outsourcing XML conversion also frequently compels the use of standardized or simplified DTDs for productivity reasons, rather than the richly structured DTDs demanded by the application. Sacrificing powerful DTDs to reach cost or productivity goals is a short-term strategy that may negatively impact the overriding knowledge-management goals of the organization.

XML publishing platforms also require an integrated system that addresses all steps in the conversion process, from input through tagging, proofing, validation, and quality control. To accomplish this in an efficient manner, high-volume XML publishing requires batch processing and workflow management solutions that optimize productivity.

Automated Markup
New technologies now emerging replace manually intensive markup processes with automated tagging to enable efficient in-house conversion projects as well as support outsourcing solutions. Autotagging technology will assign data into appropriate DTD tags based on comprehensive rules or identified strings set up by content administrators. The amount of markup that can be accomplished with these new technologies will depend on the type of document converted and the resolution of the DTD structure assigned. With many types of content, over 95% of the markup can be accomplished with autotagging, enabling mass conversion projects to be accomplished efficiently without outsourcing to a service bureau.

These new automated markup tools allow content administrators to develop comprehensive rules that define, identify, and assign element tags based on user-specified DTD/Schemas. These rules can be extremely sophisticated patterns set up with strings based on key words, phrases, document location, data type (numeric, alphanumeric), or other identifiable pattern or identifier.

Representative sample documents that are fully marked up are used to identify the patterns, signifiers, and rules that indicate element tags. Drop-down menus list the possible rule components, such as key words, digits, spacing, and formats that may be grouped to form a rule.

Automated markup can also be extremely effective for converting simple DTD/Schemas into complex DTD/ Schemas and in replacing costly, cumbersome, scripted approaches to XML conversions. For highly variable content with a thorough DTD, 60-70% of the markup can be accomplished - enough to make an enormous impact on the cycle time and cost of XML publishing.

Workflow Design and Production Control
High-volume XML publishing requires workflow management solutions to achieve appropriate productivity and production control goals. For enterprise-scale processing of XML data, content administrators want to assign and distribute separate production tasks to various operators to maximize output, balance workloads, and ensure the highest quality. In addition, the workflow design must provide effective management oversight, enabling appropriate monitoring, notification, approvals, and audit trails. In most applications the production system will also require policies and procedures to ensure appropriate information security and workstation statistics for production control. To accomplish these objectives, workflow management tools need to be highly integrated into the mass conversion process from authoring and OCR scanning, through a multistep markup process and on to proofing and validation.

To achieve this high level of integration between workflow management and production tasks, XML publishing platforms require a project dispatching system from initial input (electronic document or scan), autotagging, manual tagging, proofing, validation, and posting. The system should manage this process by automatically dispatching work-in-process files to workstations based on a predetermined design established by administrators. The workflow dispatching system is integrated into the scanning, autotagging, manual tagging, and proofing/validation tools throughout, and work-in-process status can be determined at any time by authorized content managers.

Legacy document conversion often must be accommodated in mass conversion operations. Capturing accurate data from scanned documents is conducted as a preprocess module prior to XML conversion and may require varying degrees of proofing/quality control. Conversion of legacy digital file formats such as MS Office Suite files, PageMaker files, RTF files, or PDF files into XML requires less proofing/quality control process prior to XML markup and also can be seamlessly integrated into the mass conversion process.

Workflow design becomes an important factor in optimizing XML conversion. The number of workstations for any particular production process will be determined by the specific nature of the project, such as expected accuracy of the OCR process or quantity of document types (i.e., scanned or electronic documents). Workflow design using automated markup processes will be significantly different from those dependent on manual markups. In manual markup processes workflow design is typically based on DTD complexity where specialized tagging tasks are distributed to operators specifically trained in a subset of the content. Automated markup processes using autotagging are designed to eliminate these specialized tagging processes, enabling highly sophisticated markups to be accomplished through only two stages: autotagging and manual tagging for exception handling. In this way autotagging allows the workflow design to be simplified and fixed across a variety of conversion projects; the number of workstations will be dictated primarily by volume.

The combination of autotagging solutions with effective manual markups for exception handling and quality control allows content administrators to specify the depth and detail of the DTD/Schema based on the application without a loss in productivity or cycle time. To allow for breaking up the tagging process for complex documents in a production setup, advanced DTD editors provide multiple "views" of a richly structured DTD tree, further simplifying and accelerating the conversion of specialized documents.

Designing a High-Volume Conversion Process
The real-world variability in content types, content sources, volume, and in-house resources requires customized process solutions to realize efficient high-volume conversion. Only through a rigorous analysis phase and thorough process design can an XML conversion system be optimized for any particular environment. The use of consultants without commercial ties to specific vendors is often extremely useful in fully evaluating options and understanding the critical links between technology tools and human-factor organizational issues.

The process for mass conversion typically begins with a planning phase that addresses variables such as source material evaluation, target evaluation (number of targets and DTDs), volume estimates, and time frame. These issues will define the project scope, budget, and quality levels. The design phase will determine process flows, organizational requirements, infrastructure needs, workload balancing tools, and other implementation requirements.

Because the accuracy of OCR engines, organizational issues, and XML conversion quality must be thoroughly validated before full production, a proof of concept is generally a part of any comprehensive implementation. Output quality of every document source must be scrutinized in detail to eliminate errors and anomalies. In addition, productivity goals, cost estimates, and other objectives require validation.

Even with a rigorous testing phase, ongoing monitoring of quality will be required due to the variability of labor-intensive proofing and tagging steps. Conversion schedules, material trafficking, exception reporting, and delivery mechanisms need to be optimized after system rollout to achieve full benefits and efficiencies.

As XML publishing proliferates in corporations, governments, educational institutions, service bureaus, and other organizations, seamlessly integrating the markup process into content generation and authoring will be a primary objective for tool and platform developers. For cycle time and information security reasons, organizations will increasingly look toward cost-effective methods to accomplish the markup process through in-house means.

Trends in both workflow management and automated markups anticipate an expanded role for XML content administrators. Rather than specific expertise in XML syntax and semantics, content administrators will require new skill sets in organizational leadership, process design, and process management. The widespread acceptance of XML - along with advances in tools and publishing platforms - will usher in a new, expanded role for the content administrator in today's dynamic organizations.

Pattern Recognition for Automated XML Tagging
The inherent complexities and inefficiencies of manual markup tagging for mixed-structure content is being addressed through tools that use rules driven by identified strings and patterns to automate the tagging process. These tools create unique pattern files that are associated with specific elements. Elements generally have multiple rules or patterns that can be used to identify accurate tags.

Rather than writing Perl scripts, drag-and-drop is the fastest and most accurate way to associate structure points and data to a pattern. Sample documents are fully marked up and used to identify the patterns, signifiers, and rules that indicate element tags. Common expressions that may indicate relevant text and structure associated with a rule can be quickly identified using color coding, multiple windows, and other user interface techniques. Drop-down menus list the possible rule components such as key words, digits, spacing, and formats that may be grouped to form a rule.

Testing, refining, and optimizing the rules over sample documents are integral parts of the process. Processing sample documents identifies additional rules, exceptions, anomalies, and other factors that can be accounted for in the full production run. Rule libraries can be developed to allow content administrators to quickly build autotagging processes for common elements over a wider variety of document types. In this way the automated-tagging process can be optimized over time, eliminating a vast majority of the manual markup requirements for XML publishing.

Practical Examples
Using the conventions for pattern components and character mnemonics in Tables 1 and 2, a sample pattern file can be created. Table 1 summarizes the major pattern components required for effective automated markup. The elements are used to initiate and execute a pattern search. Drop-down menus facilitate rule building. Table 2 includes some of the character mnemonics and their relationship to Perl scripts.

A rule for identifying an element begins with an examination of what patterns constitute the data. Refer to the date January 1, 2002, as seen below, for an example.

The process for creating a rule based on this character set requires defining the pattern for each component (month, day, comma, year) and associating the block to the DTD. Each pattern is created through drag-and-drop by simply highlighting the component with a mouse click. The completed pattern file, composed of multiple patterns, would then look like Listing 1.

The pattern in Listing 1 would produce the following in the XML output:

<document>
<date>
<month>January</month>
<day>1</day>
<year>2002</year>
</date>
</document>

Common rules, such as the one above, can be stored in rules libraries and reused for multiple document types. Many observers believe that only through such automated markup techniques can XML publishing be efficiently incorporated in high-volume content applications consisting of mixed-structure content. Further developments in the use of intelligent rule-based algorithms to automate the markup process will emerge for applications in the legal, financial, technical, medical, regulatory, and other fields involving mixed-structure content. These developments will involve both Boolean operations and other forms of artificial intelligence, such as neural networks, to produce XML markups.

About Evan Huang
Evan Huang is cofounder and chief technology officer of XMLCities, a developer of XML content creation, conversion, and publishing tools. A frequent contributor to various technical journals, Evan previously held positions at SRI and Adobe Systems and taught at Notre Dame and Northwestern Polytechnic. He holds a PhD in electrical engineering from Notre Dame.

XML JOURNAL LATEST STORIES . . .
3rd International Virtualization Conference & Expo: Themes & Topics
From Application Virtualization to Xen, a round-up of the virtualization themes & topics being discussed in NYC June 23-24, 2008 by the world-class speaker faculty at the 3rd International Virtualization Conference & Expo being held by SYS-CON Events in The Roosevelt Hotel, in midtown
Red Hat Named "Platinum Sponsor" of Virtualization Conference & Expo
Red Hat is a trusted open source provider. Red Hat offers enterprise customers a long-term plan for building infrastructures on the quality and innovation of open source. Combining open source operating system platform, Red Hat Enterprise Linux, together with applications, management
JustSystems Contributes Key XBRL Rendering Technology to Financial Community
JustSystems announced that it is contributing intellectual property rights for its invention of eXtensible Business Reporting Language (XBRL) rendering technologies to XBRL International, the standards body responsible for the oversight of the XBRL specification. The invention, known a
JustSystems Launches Campaign for XBRL Success
JustSystems announced its campaign to help organizations adopt XBRL (eXtensible Business Reporting Language), the XML-based standard for communicating financial and business information. In related news, JustSystems also announced that it has contributed intellectual property rights of
Virtualization Meets DaaS - Desktop-as-a-Service
After a $1.5 million angel round, Desktone, which was started in 2006 by Eric Pulier, who also started SOA Software, US Interactive and IVT, picked up $17 million in first-round funding about a year ago from Highland Capital Partners, SoftBank Capital, Citrix Systems and the China-base
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS


ADS BY GOOGLE
BREAKING XML NEWS
RCG IT Addresses BI and SOA Convergence and Business Architecture at TDWI World Conference in Chicago
RCG Information Technology, Inc. (http://www.rcgit.com/) will participate in The Data Wareho