|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Feature Content Management Part 2
Content Management Part 2
By: Jim Gabriel
Jul. 10, 2001 12:00 AM
The golden rule of a content management system is this: the day you take it to production is the day you start work on the next version of the system. Content contributors will submit change requests for the document structures and input formats, publishers will ask for more metadata to enable more sophisticated delivery of content, and editors will ask for better workflow. And you may discover major design flaws in the logic governing how the documents are supposed to interact. So you break open the definitions and trash the database. Or do you? "Schema evolution? No need for it. I believe our users get it right first time," said the CEO of a leading supplier of content management software, at the Seybold Boston conference, April 2001. Research and practical experience show that fault-tolerant XML and XML schema evolution are of critical importance to the successful development and management of any complex XML-based application, especially a content management system. This article addresses document-centric content management systems, as used in corporate publishing, content syndication, and conventional publishing activities. Imagine the luxurious point at the beginning of the project when everything is still a clean sheet. It's day one. You've selected and purchased a content management system. It's still in the box. Here's a list of the things that you now have to do:
- Convert existing material - Rewrite as appropriate - Add new content - Go live and stay live! These steps become less critical the further along you get. The most important ones are the information-mapping exercise and the way you build the XML environment. At the beginning of the project, you have the power to build a system that allows you to evolve schemas and move with the changing requirements of your organization. Get the early stages wrong, and you'll find yourself locked into the kind of logic trap that there's no recovering from - version 37 of your DTDs and no room for any more changes.
Carry Out the Information-Mapping Exercise
Guiding Principles Redundancy is dangerous. Think of your system as a highly tuned, normalized database, rather than as a place where documents are kept. Your application - the reason you're building this system in the first place - is like a well-built database application. The principles are the same. Data is enriched with metadata. Relationships are defined and enforced. Referential integrity constraints are factored in. Links are not allowed to die.
Building the XML Environment
The trick, however, is to implement the system using an iterative process of development. This means building the system in a way that allows you to have round-trip access to your starting point - the map or model - as opposed to locking you into the cascading model in which you can't get back to the starting point. This is a complicated subject that should not be treated lightly, and which I discuss later.
User Interface Another good example is to state that a fixed range of choices should always appear as a dropdown list if there are more than seven choices, and a listbox if less. These criteria affect the way the whole application is built (see Figures 2 and 3). In many cases you won't have the chance to influence the user interface without programming against the application programming interface (API) of the system you've purchased, which can quickly become prohibitively expensive. It's worth undertaking this analysis of your requirements before assembling the request for proposal (RFP) from the supplier of the content management system in the first place.
Small Documents The alternative is to keep the illustration in the place where it is first used - in the surrounding body paragraphs of a large chapter, for example - and accept that you can't index it, reuse it elsewhere, or apply simple rules-based publishing to suppress it or convert it to a popup. Admittedly, you can write as many scripts and XSLT transformations as you like to achieve this functionality, and you can even hard-code publishing programs using the API of the content management system (if there is one), assuming that you have the time and skills in-house. But it's not a cost-effective approach. Maintenance is expensive, scalability is reduced, and bugs are more likely to occur. Remember, any content management system is only as versatile as the structures you impose on it. This means that the monolithic structures so typical of SGML environments (using such epic DTDs as Docbook) are not advisable. Your corporate database records are not efficient with hundreds of fields, and you should think of your documents in the same way. Build small building blocks and publish the larger things that you can construct with those building blocks. However, small topics increase the administration effort required of content contributors and publishers. You should also consider locking issues and transaction control when multiple authors are sharing material. On the other hand, small topics can significantly increase the quality, flexibility, and performance of the runtime system. It's advisable to find a sensible level of categorization when defining document types. Look for the commonality in your definitions and exploit it. Remember that every structure you define potentially requires specific naming when handling style, transformation, and other actions or properties. This increases the work required to go live, and decreases the maintainability of the system. Imagine, for example, that you're building a system for publishing a catalog for use in an online marketplace. Your catalog describes many different things, some of which are raw materials and some processed. You could choose to define separate document types for raw materials and processed materials, or you could define one for both and include an element or attribute that lets you differentiate between the two. The latter makes it easier to maintain.
Maintainability You could drop the element A and replace it wherever it has been used with the sequence (B, C). You could redefine A to contain the child content (B, C) and nothing else (that is, not have the content be mixed). If your entire content consists of document instances using that DTD, your whole system will probably break. If element A occurs only in one small document type in a set of many other document types, most of the system will survive the change. Conversely, if element A is used in many other DTDs, you have a problem. How do you know where element A has been used? How do you identify in every DTD where A is declared that it's a reused object definition from another source? Analyzing the impact of a potential change to a complex XML environment is currently not a scientific process. You can search through DTDs and schemas for names of objects that you know are going to change. You can make a change in a DTD or schema and parse all the derived document instances and see what breaks. You can look for matching patterns in stylesheets, XSLT files, XML document instances, and so on. If you have a simple system and a small deployed set of documents, you can probably afford to spend some time searching and replacing, and cranking the existing document instances back in line with the new version of reality. If you have a large, complex system, especially one that can't afford any downtime, you have a serious problem. The solution is usually to move up a version and leave some legacy alive under an older version. Content management systems are notoriously unable to support change. Take DTDs, for example. Most content management systems associate a document with a DTD that is stored elsewhere in the system without explicitly understanding the DTD. The document is fed to the authoring software together with its DTD, and parsed when it's next checked back in. Manipulating the DTD in the content management system is not common functionality. How can the system understand the change that you need to implement? As Figure 4 shows, maintainable content management environments should implement the cyclical, or spiral, process of development. True evolution round-trips through the modeling phase to build on the good and throw away the bad.
Fault-Tolerant XML Managing an XML environment requires an OO approach to setting it up in the first place. Using true OO techniques at the design level, the element A (that is, object A) can exist in only one place. The design level should be an abstract, conceptual space where the rules that govern the use of the content management system can be recorded. Structures modeled in the design level can reference object A, but not copy in the original object and cut the link with the starting point. Such structures should be used for generating DTDs, schemas, and associated properties. The content management system shouldn't lose the link with the conceptual design level. Content management systems need to evolve. Unfortunately, changing the content model in any way for any schema usually means breaking all existing document instances. Fixing them can cost as much effort as the original implementation. This is because XML environments typically lock you into a linear process in which there is no intelligent modeling space in which your content models can be dealt with in a way that handles all the dependencies correctly. The moment you generate deployment files that use the definitions that you're expressing in XML (DTDs, stylesheets, Java classes, XSLT transformations, stored procedures, and so on), you're running the risk of locking yourself into just such a linear application development process. Certainly, if these deployment files can't be automatically generated, the process is already linear. As Figure 5 shows, linear development processes are expensive when you need to make changes. Ideally, systems evolve through cyclical design. A linear development process is the equivalent of what used to be known as the "cascading" or "waterfall" method of development, in which each phase hands over irrevocably to the next phase, and no return is possible. In other words, you design, then develop, then test, then deploy. If a fix is required, you start again. In a perfect system, a model-driven architecture allows you to use a cyclical development process to round-trip back to the design stage when change is needed, thus automating the regeneration of environments and the implementation of changes in the documents. Ideally, a designer should be able to make a change to an object in a conceptual design space, and then analyze the impact of that change to the other objects and to the deployed document instances. The designer should be able to automate the implementation of the change both in the way the environment has been built and in the deployed document instances themselves. Strictly speaking, the content management system should be able to create new DTDs on the fly according to a requirement from the outside world, and transform an entire database of content to fit the new DTD. Barbadosoft was founded to provide the XML infrastructure software to answer these needs. Truly fault-tolerant and maintainable XML should be an essential part of any XML application that deals with sources and data or document instances. The Barbadosoft solution takes the form of an "XML virtual machine," or programmable infrastructure, that manages the sources and deployed objects of an XML application through the key phases of the application's life cycle: design, development, deployment, and maintenance. Barbadosoft provides XML object modeling, impact analysis, change management, and infinitely extensible property sets in a rich model of interdependencies and relationships. In the absence of such infrastructure, you can help yourself by carefully considering each design decision and trying to imagine the impact of a future change before building the environment. If the repercussions of a change in a fine-grained, small-topic architecture are potentially too great, use larger document types. It's a fine balance. Monolithic DTDs generally reduce maintainability, and smaller document types increase maintainability.
Go from Where You Are Now to Where You Want to Be
A significant number of new users of XML-based content management systems come from an SGML background. Conversion from SGML to XML is relatively straightforward in terms of the documents and DTDs. (See James Clark's sx product at www.jclark.com/sp/sx.htm. After all, most SGML without a DTD is nearly well-formed XML already. Generally, most SGML converts to XML very well. SGML to XML is a "down translation," however (going from more complicated to less complicated), which means that you might need to make some decisions if the conversion encounters functionality in the SGML that's not supported in XML. You'll have problems with more obscure SGML constructs such as HyTime, and if this daunts you I would advise seeking professional help. If your existing data is in any structured database format, you're lucky. Many database systems nowadays support an XML export function, which gets you most of the way there. If such a function doesn't exist, serializing the data and converting it to XML should be an easy task for even the most humble writer of Perl scripts (or the equivalent). Once you have XML, a relatively straightforward series of transformations using XSLT should convert the data into the structure you've built for the new content management system. Converting unstructured data such as Adobe FrameMaker or Microsoft Word files is very difficult. The "Save As XML" functions of desktop publishing and word-processing software is so unreliable it's dangerous. Unless you've been very strict in the use of templates, paragraph-naming conventions, and styles, attempting to derive structure from the data can be almost impossible. And even if you have been strict, the chances of being able to match old document styles one-on-one into new XML document types are small. I discovered to my cost in a similar project (I had around 10,000 pages of FrameMaker files, roughly the same in WinHelp format, and a smattering of Word documents) that while it is possible to use various "Save As" formats, scripting languages, and other conversion tools, the time and rewriting required to make it work outweighed the advantages of the automation. In that example, the purpose of the project was to replace a traditional corporate technical publishing process with one driven by a content management system: do more with less, reduce costs, increase usability, and so on. The new way of authoring documents in the new system, coupled with the radically different way of accessing the information from a user's perspective, meant that almost all the older material deserved to be rewritten rather than blindly converted.
Go Live and Stay Live!
It's well worth trying to change at least one important DTD before going live, and seeing what happens. Your ability to future-proof the application will make or break the system when emergencies happen (such as being asked by Commerce One to move from xCBL 2.0 to xCBL 3.0, or 4.0). Write a set of procedures. Make sure that a delegate can understand and do the work. This brings me to my final piece of advice: document everything you do. Make sure that at the very least you have the following documents in place: the information map, the descriptions of the DTDs, and the procedures for keeping the system alive when you need to make changes. For change, you surely will. XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||