|
|
YOUR FEEDBACK
SOA World Conference
Virtualization Conference $200 Savings Expire May 16, 2008... – Register Today! Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Content Management
Trends in High Volume XML Publishing
By: Evan Huang
Digg This!
Integrating efficient XML publishing into high-volume content environments remains a significant challenge. Among the many real-world barriers: the need to convert quantities of paper and other legacy documents and to integrate easy-to-use XML publishing tools into the content-creation process, and the lack of workflow management tools necessary for mass conversion environments. In many environments content creators resist using XML authoring tools, preferring traditional word processing or desktop publishing applications, and simplified "template-style" DTDs are used to accommodate productivity requirements. Consequently, high-volume XML conversions are typically accomplished through "brute force" solutions, where mass OCR (optical character recognition) scanning and tagging are done through expensive outsourcing, often to developing countries where labor for repetitive high-volume publishing tasks is plentiful and inexpensive. To realize the full benefits of XML for both highly structured and mixed-structure content in high volumes - without the cost, cycle time requirements, and other outcomes inherent in outsourcing - an XML publishing system that minimizes the ongoing intervention of XML programmers is essential. To efficiently convert documents in Word, HTML, PDF, RTF, or other formats into XML, intelligent, rule-based automated markup solutions are required. For large-volume projects, efficient XML publishing also requires batch processing and workflow management solutions that optimize productivity. In this article I'll discuss the process requirements for high-volume XML creation and introduce new tools and technologies expressly developed for these mass-conversion environments.
The High-Volume XML Publishing Challenge
For highly structured content, identifying and tagging variables can easily be automated through forms, scripts, and other techniques, but mixed-structure data requires an XML authoring tool or a postauthoring conversion process. The assignment of tags to elements in a document is fundamentally a separate and distinct exercise from the authoring process. Authors may intuitively recognize elements - such as a "phone number," "chapter heading," "ingredient," or "customer type," but their identification as an element requires a start tag, an element type, and an end tag delimited by brackets (<phonenumber>). This process can be simplified and accelerated, but it can't be eliminated. Requiring content creators - knowledge workers such as technical writers, paralegals, insurance adjusters, law enforcement personnel, research professionals, and lab technicians - to perform manual tagging on original content is an unrealistic expectation in many environments. While considerable advancements have been made in XML authoring tools, they remain unappealing to many users who prefer standard word processing or desktop publishing software. Drop-down menus for tag selection, template support, advanced scripting, macros, and other modern enhancements to XML authoring systems still require manual tagging, an activity that is wholly separate from the document creation process. In addition to the variability of document type, the thoroughness of document representation also may vary widely, depending on the application. It's possible to categorize a document simply by title, date, subject, and author, or render a document with hundreds of variables. The wide variability in content types, content sources, and data applications requires the customization of virtually every high-volume XML publishing system in order to achieve specific business goals. High-volume XML publishing also frequently involves legacy documents in paper or electronic form. It's not uncommon for an XML publishing project to involve warehouses of paper in bankers' boxes, thousands of pounds of microfilm and microfiche, thousands of tapes or disks in obsolete, proprietary formats, and/or literally terabytes of PDF files. In many industries the events that initiate an XML project - such as mergers/acquisitions, new document management procedures, government regulations, and/or new business initiatives - are also the events that involve the greatest volume of archival information requiring XML conversion.
Key Requirements for a Mass Conversion Platform
Another requirement for high-volume XML publishing is that the DTD/Schema structure should be determined by the user or application, not by the XML editing tool, production platform, or skills of the operators. In addition to high cost and long cycle time, outsourcing XML conversion also frequently compels the use of standardized or simplified DTDs for productivity reasons, rather than the richly structured DTDs demanded by the application. Sacrificing powerful DTDs to reach cost or productivity goals is a short-term strategy that may negatively impact the overriding knowledge-management goals of the organization. XML publishing platforms also require an integrated system that addresses all steps in the conversion process, from input through tagging, proofing, validation, and quality control. To accomplish this in an efficient manner, high-volume XML publishing requires batch processing and workflow management solutions that optimize productivity.
Automated Markup
These new automated markup tools allow content administrators to develop comprehensive rules that define, identify, and assign element tags based on user-specified DTD/Schemas. These rules can be extremely sophisticated patterns set up with strings based on key words, phrases, document location, data type (numeric, alphanumeric), or other identifiable pattern or identifier. Representative sample documents that are fully marked up are used to identify the patterns, signifiers, and rules that indicate element tags. Drop-down menus list the possible rule components, such as key words, digits, spacing, and formats that may be grouped to form a rule. Automated markup can also be extremely effective for converting simple DTD/Schemas into complex DTD/ Schemas and in replacing costly, cumbersome, scripted approaches to XML conversions. For highly variable content with a thorough DTD, 60-70% of the markup can be accomplished - enough to make an enormous impact on the cycle time and cost of XML publishing.
Workflow Design and Production Control
To achieve this high level of integration between workflow management and production tasks, XML publishing platforms require a project dispatching system from initial input (electronic document or scan), autotagging, manual tagging, proofing, validation, and posting. The system should manage this process by automatically dispatching work-in-process files to workstations based on a predetermined design established by administrators. The workflow dispatching system is integrated into the scanning, autotagging, manual tagging, and proofing/validation tools throughout, and work-in-process status can be determined at any time by authorized content managers. Legacy document conversion often must be accommodated in mass conversion operations. Capturing accurate data from scanned documents is conducted as a preprocess module prior to XML conversion and may require varying degrees of proofing/quality control. Conversion of legacy digital file formats such as MS Office Suite files, PageMaker files, RTF files, or PDF files into XML requires less proofing/quality control process prior to XML markup and also can be seamlessly integrated into the mass conversion process. Workflow design becomes an important factor in optimizing XML conversion. The number of workstations for any particular production process will be determined by the specific nature of the project, such as expected accuracy of the OCR process or quantity of document types (i.e., scanned or electronic documents). Workflow design using automated markup processes will be significantly different from those dependent on manual markups. In manual markup processes workflow design is typically based on DTD complexity where specialized tagging tasks are distributed to operators specifically trained in a subset of the content. Automated markup processes using autotagging are designed to eliminate these specialized tagging processes, enabling highly sophisticated markups to be accomplished through only two stages: autotagging and manual tagging for exception handling. In this way autotagging allows the workflow design to be simplified and fixed across a variety of conversion projects; the number of workstations will be dictated primarily by volume. The combination of autotagging solutions with effective manual markups for exception handling and quality control allows content administrators to specify the depth and detail of the DTD/Schema based on the application without a loss in productivity or cycle time. To allow for breaking up the tagging process for complex documents in a production setup, advanced DTD editors provide multiple "views" of a richly structured DTD tree, further simplifying and accelerating the conversion of specialized documents.
Designing a High-Volume Conversion Process
The process for mass conversion typically begins with a planning phase that addresses variables such as source material evaluation, target evaluation (number of targets and DTDs), volume estimates, and time frame. These issues will define the project scope, budget, and quality levels. The design phase will determine process flows, organizational requirements, infrastructure needs, workload balancing tools, and other implementation requirements. Because the accuracy of OCR engines, organizational issues, and XML conversion quality must be thoroughly validated before full production, a proof of concept is generally a part of any comprehensive implementation. Output quality of every document source must be scrutinized in detail to eliminate errors and anomalies. In addition, productivity goals, cost estimates, and other objectives require validation. Even with a rigorous testing phase, ongoing monitoring of quality will be required due to the variability of labor-intensive proofing and tagging steps. Conversion schedules, material trafficking, exception reporting, and delivery mechanisms need to be optimized after system rollout to achieve full benefits and efficiencies. As XML publishing proliferates in corporations, governments, educational institutions, service bureaus, and other organizations, seamlessly integrating the markup process into content generation and authoring will be a primary objective for tool and platform developers. For cycle time and information security reasons, organizations will increasingly look toward cost-effective methods to accomplish the markup process through in-house means. Trends in both workflow management and automated markups anticipate an expanded role for XML content administrators. Rather than specific expertise in XML syntax and semantics, content administrators will require new skill sets in organizational leadership, process design, and process management. The widespread acceptance of XML - along with advances in tools and publishing platforms - will usher in a new, expanded role for the content administrator in today's dynamic organizations.
Pattern Recognition for Automated XML Tagging
Rather than writing Perl scripts, drag-and-drop is the fastest and most accurate way to associate structure points and data to a pattern. Sample documents are fully marked up and used to identify the patterns, signifiers, and rules that indicate element tags. Common expressions that may indicate relevant text and structure associated with a rule can be quickly identified using color coding, multiple windows, and other user interface techniques. Drop-down menus list the possible rule components such as key words, digits, spacing, and formats that may be grouped to form a rule. Testing, refining, and optimizing the rules over sample documents are integral parts of the process. Processing sample documents identifies additional rules, exceptions, anomalies, and other factors that can be accounted for in the full production run. Rule libraries can be developed to allow content administrators to quickly build autotagging processes for common elements over a wider variety of document types. In this way the automated-tagging process can be optimized over time, eliminating a vast majority of the manual markup requirements for XML publishing.
Practical Examples
A rule for identifying an element begins with an examination of what patterns constitute the data. Refer to the date January 1, 2002, as seen below, for an example. The process for creating a rule based on this character set requires defining the pattern for each component (month, day, comma, year) and associating the block to the DTD. Each pattern is created through drag-and-drop by simply highlighting the component with a mouse click. The completed pattern file, composed of multiple patterns, would then look like Listing 1. The pattern in Listing 1 would produce the following in the XML output:
<document> Common rules, such as the one above, can be stored in rules libraries and reused for multiple document types. Many observers believe that only through such automated markup techniques can XML publishing be efficiently incorporated in high-volume content applications consisting of mixed-structure content. Further developments in the use of intelligent rule-based algorithms to automate the markup process will emerge for applications in the legal, financial, technical, medical, regulatory, and other fields involving mixed-structure content. These developments will involve both Boolean operations and other forms of artificial intelligence, such as neural networks, to produce XML markups. XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||