Welcome!

Industrial IoT Authors: William Schmarzo, Elizabeth White, Stackify Blog, Yeshim Deniz, SmartBear Blog

Related Topics: Industrial IoT

Industrial IoT: Article

Content Management Part 2

Content Management Part 2

The golden rule of a content management system is this: the day you take it to production is the day you start work on the next version of the system.

Content contributors will submit change requests for the document structures and input formats, publishers will ask for more metadata to enable more sophisticated delivery of content, and editors will ask for better workflow.

And you may discover major design flaws in the logic governing how the documents are supposed to interact. So you break open the definitions and trash the database. Or do you?

"Schema evolution? No need for it. I believe our users get it right first time," said the CEO of a leading supplier of content management software, at the Seybold Boston conference, April 2001.

Research and practical experience show that fault-tolerant XML and XML schema evolution are of critical importance to the successful development and management of any complex XML-based application, especially a content management system.

This article addresses document-centric content management systems, as used in corporate publishing, content syndication, and conventional publishing activities. Imagine the luxurious point at the beginning of the project when everything is still a clean sheet. It's day one. You've selected and purchased a content management system. It's still in the box.

Here's a list of the things that you now have to do:

  • Carry out an information-mapping exercise
  • Build the XML environment
  • Go from where you are now to where you want to be
     - Convert existing material
     - Rewrite as appropriate
     - Add new content
     - Go live and stay live!

    These steps become less critical the further along you get. The most important ones are the information-mapping exercise and the way you build the XML environment.

    At the beginning of the project, you have the power to build a system that allows you to evolve schemas and move with the changing requirements of your organization. Get the early stages wrong, and you'll find yourself locked into the kind of logic trap that there's no recovering from - version 37 of your DTDs and no room for any more changes.

    Carry Out the Information-Mapping Exercise
    In a previous article (XML-J, Vol. 2, issue 6), we explained that mapping the information in your organization means (1) charting what kinds of information the users of the system are expecting to receive and (2) how you intend to build that information at the authoring stage. An information map formally describes the following (see Figure 1):

    • Common information types
    • Reusable objects (textual or otherwise)
    • Architecture of the information system from an end-user's perspective
    • Architecture of the information system from a contributor's perspective
    • Fine-grained information topics (concept, example, illustration, table, task, purchase order, invoice, and so on)
    • Publication structures that knit topics together in an appropriate way
    • Usage patterns for typical end users
    • Procedures for contributors
    • Workflow
    The information map provides a blueprint for the whole system. A system builder or integrator should be able to build most of the content management system from the information map.

    Guiding Principles
    Your guiding principles should be to keep your documents small, to build a modular system, to reuse content and definitions as much as possible, and to create powerful metadata. Your content management system gives you the power to use single-source publishing techniques, so do that.

    Redundancy is dangerous. Think of your system as a highly tuned, normalized database, rather than as a place where documents are kept. Your application - the reason you're building this system in the first place - is like a well-built database application. The principles are the same. Data is enriched with metadata. Relationships are defined and enforced. Referential integrity constraints are factored in. Links are not allowed to die.

    Building the XML Environment
    From an XML point of view, the information-mapping process will provide a starting point for DTDs and schemas, a definition of which objects are reused and which are not, style considerations, transformation considerations, workflow, and so on. A system builder or integrator should be able to build most of the system from this map.

    The trick, however, is to implement the system using an iterative process of development. This means building the system in a way that allows you to have round-trip access to your starting point - the map or model - as opposed to locking you into the cascading model in which you can't get back to the starting point. This is a complicated subject that should not be treated lightly, and which I discuss later.

    User Interface
    From a content contributor's perspective, the user interface is equivalent to the visible part of any database application. It's in your power to design a very good user interface, so plan one at the information map stage. An example of this is to state that a content contributor should never have to explicitly choose a default value, or supply information that the system could obtain from profiles, or choose some other kind of variable.

    Another good example is to state that a fixed range of choices should always appear as a dropdown list if there are more than seven choices, and a listbox if less. These criteria affect the way the whole application is built (see Figures 2 and 3).

    In many cases you won't have the chance to influence the user interface without programming against the application programming interface (API) of the system you've purchased, which can quickly become prohibitively expensive. It's worth undertaking this analysis of your requirements before assembling the request for proposal (RFP) from the supplier of the content management system in the first place.

    Small Documents
    Small documents are concise topics with a discrete semantic value. For example, an illustration is something that works very well as a document type in its own right, and should therefore be defined as a separate document type. Storing the illustration as a separate, small topic allows you to easily reuse the illustration in another document, or endow it with functionality at runtime such as representing it as a thumbnail image that expands to a popup window when the user clicks on it.

    The alternative is to keep the illustration in the place where it is first used - in the surrounding body paragraphs of a large chapter, for example - and accept that you can't index it, reuse it elsewhere, or apply simple rules-based publishing to suppress it or convert it to a popup.

    Admittedly, you can write as many scripts and XSLT transformations as you like to achieve this functionality, and you can even hard-code publishing programs using the API of the content management system (if there is one), assuming that you have the time and skills in-house. But it's not a cost-effective approach. Maintenance is expensive, scalability is reduced, and bugs are more likely to occur.

    Remember, any content management system is only as versatile as the structures you impose on it. This means that the monolithic structures so typical of SGML environments (using such epic DTDs as Docbook) are not advisable. Your corporate database records are not efficient with hundreds of fields, and you should think of your documents in the same way. Build small building blocks and publish the larger things that you can construct with those building blocks.

    However, small topics increase the administration effort required of content contributors and publishers. You should also consider locking issues and transaction control when multiple authors are sharing material. On the other hand, small topics can significantly increase the quality, flexibility, and performance of the runtime system.

    It's advisable to find a sensible level of categorization when defining document types. Look for the commonality in your definitions and exploit it. Remember that every structure you define potentially requires specific naming when handling style, transformation, and other actions or properties.

    This increases the work required to go live, and decreases the maintainability of the system. Imagine, for example, that you're building a system for publishing a catalog for use in an online marketplace. Your catalog describes many different things, some of which are raw materials and some processed. You could choose to define separate document types for raw materials and processed materials, or you could define one for both and include an element or attribute that lets you differentiate between the two. The latter makes it easier to maintain.

    Maintainability
    Imagine that you need to make a change to a DTD. An element (A) no longer provides the scope you need, and should be replaced by a sequence of two other elements (B, C). There are various ways of implementing this.

    You could drop the element A and replace it wherever it has been used with the sequence (B, C). You could redefine A to contain the child content (B, C) and nothing else (that is, not have the content be mixed). If your entire content consists of document instances using that DTD, your whole system will probably break. If element A occurs only in one small document type in a set of many other document types, most of the system will survive the change.

    Conversely, if element A is used in many other DTDs, you have a problem. How do you know where element A has been used? How do you identify in every DTD where A is declared that it's a reused object definition from another source?

    Analyzing the impact of a potential change to a complex XML environment is currently not a scientific process. You can search through DTDs and schemas for names of objects that you know are going to change. You can make a change in a DTD or schema and parse all the derived document instances and see what breaks. You can look for matching patterns in stylesheets, XSLT files, XML document instances, and so on.

    If you have a simple system and a small deployed set of documents, you can probably afford to spend some time searching and replacing, and cranking the existing document instances back in line with the new version of reality. If you have a large, complex system, especially one that can't afford any downtime, you have a serious problem. The solution is usually to move up a version and leave some legacy alive under an older version.

    Content management systems are notoriously unable to support change. Take DTDs, for example. Most content management systems associate a document with a DTD that is stored elsewhere in the system without explicitly understanding the DTD. The document is fed to the authoring software together with its DTD, and parsed when it's next checked back in.

    Manipulating the DTD in the content management system is not common functionality. How can the system understand the change that you need to implement?

    As Figure 4 shows, maintainable content management environments should implement the cyclical, or spiral, process of development. True evolution round-trips through the modeling phase to build on the good and throw away the bad.

    Fault-Tolerant XML
    Consider the primary benefits (in an XML sense) of object-oriented (OO) programming: polymorphism, encapsulation, and inheritance. Change element A at the source, and all references or extensions to it should automatically inherit that change, thus allowing a designer to make a change to a source definition and regenerate a complete environment at the touch of a magic button. This, alas, is not the case in the current product offering in content management.

    Managing an XML environment requires an OO approach to setting it up in the first place. Using true OO techniques at the design level, the element A (that is, object A) can exist in only one place. The design level should be an abstract, conceptual space where the rules that govern the use of the content management system can be recorded. Structures modeled in the design level can reference object A, but not copy in the original object and cut the link with the starting point.

    Such structures should be used for generating DTDs, schemas, and associated properties. The content management system shouldn't lose the link with the conceptual design level.

    Content management systems need to evolve. Unfortunately, changing the content model in any way for any schema usually means breaking all existing document instances. Fixing them can cost as much effort as the original implementation. This is because XML environments typically lock you into a linear process in which there is no intelligent modeling space in which your content models can be dealt with in a way that handles all the dependencies correctly.

    The moment you generate deployment files that use the definitions that you're expressing in XML (DTDs, stylesheets, Java classes, XSLT transformations, stored procedures, and so on), you're running the risk of locking yourself into just such a linear application development process. Certainly, if these deployment files can't be automatically generated, the process is already linear.

    As Figure 5 shows, linear development processes are expensive when you need to make changes. Ideally, systems evolve through cyclical design.

    A linear development process is the equivalent of what used to be known as the "cascading" or "waterfall" method of development, in which each phase hands over irrevocably to the next phase, and no return is possible. In other words, you design, then develop, then test, then deploy. If a fix is required, you start again.

    In a perfect system, a model-driven architecture allows you to use a cyclical development process to round-trip back to the design stage when change is needed, thus automating the regeneration of environments and the implementation of changes in the documents.

    Ideally, a designer should be able to make a change to an object in a conceptual design space, and then analyze the impact of that change to the other objects and to the deployed document instances. The designer should be able to automate the implementation of the change both in the way the environment has been built and in the deployed document instances themselves.

    Strictly speaking, the content management system should be able to create new DTDs on the fly according to a requirement from the outside world, and transform an entire database of content to fit the new DTD.

    Barbadosoft was founded to provide the XML infrastructure software to answer these needs. Truly fault-tolerant and maintainable XML should be an essential part of any XML application that deals with sources and data or document instances.

    The Barbadosoft solution takes the form of an "XML virtual machine," or programmable infrastructure, that manages the sources and deployed objects of an XML application through the key phases of the application's life cycle: design, development, deployment, and maintenance. Barbadosoft provides XML object modeling, impact analysis, change management, and infinitely extensible property sets in a rich model of interdependencies and relationships.

    In the absence of such infrastructure, you can help yourself by carefully considering each design decision and trying to imagine the impact of a future change before building the environment. If the repercussions of a change in a fine-grained, small-topic architecture are potentially too great, use larger document types. It's a fine balance. Monolithic DTDs generally reduce maintainability, and smaller document types increase maintainability.

    Go from Where You Are Now to Where You Want to Be
    Most people have to integrate new content management systems with large sets of existing data. Often, the new system is designed to eventually replace the old way of doing things, which means that the content of the old way of working needs to be migrated or rewritten into the new system. The cost of getting your existing data into the new content management system depends on what format it's in. Of course, if you have no existing data or other forms of legacy material to convert, your system can go live very quickly.

    A significant number of new users of XML-based content management systems come from an SGML background. Conversion from SGML to XML is relatively straightforward in terms of the documents and DTDs. (See James Clark's sx product at www.jclark.com/sp/sx.htm. After all, most SGML without a DTD is nearly well-formed XML already.

    Generally, most SGML converts to XML very well. SGML to XML is a "down translation," however (going from more complicated to less complicated), which means that you might need to make some decisions if the conversion encounters functionality in the SGML that's not supported in XML. You'll have problems with more obscure SGML constructs such as HyTime, and if this daunts you I would advise seeking professional help.

    If your existing data is in any structured database format, you're lucky. Many database systems nowadays support an XML export function, which gets you most of the way there. If such a function doesn't exist, serializing the data and converting it to XML should be an easy task for even the most humble writer of Perl scripts (or the equivalent). Once you have XML, a relatively straightforward series of transformations using XSLT should convert the data into the structure you've built for the new content management system.

    Converting unstructured data such as Adobe FrameMaker or Microsoft Word files is very difficult. The "Save As XML" functions of desktop publishing and word-processing software is so unreliable it's dangerous. Unless you've been very strict in the use of templates, paragraph-naming conventions, and styles, attempting to derive structure from the data can be almost impossible. And even if you have been strict, the chances of being able to match old document styles one-on-one into new XML document types are small.

    I discovered to my cost in a similar project (I had around 10,000 pages of FrameMaker files, roughly the same in WinHelp format, and a smattering of Word documents) that while it is possible to use various "Save As" formats, scripting languages, and other conversion tools, the time and rewriting required to make it work outweighed the advantages of the automation.

    In that example, the purpose of the project was to replace a traditional corporate technical publishing process with one driven by a content management system: do more with less, reduce costs, increase usability, and so on. The new way of authoring documents in the new system, coupled with the radically different way of accessing the information from a user's perspective, meant that almost all the older material deserved to be rewritten rather than blindly converted.

    Go Live and Stay Live!
    Whatever you do, don't try to do everything at once. Always keep the previous system running in parallel until everything has been tried and tested. If your content management system is complex, especially if it's playing a mission-critical role in a core aspect of your business, consider phasing in the introduction in a modular way.

    It's well worth trying to change at least one important DTD before going live, and seeing what happens. Your ability to future-proof the application will make or break the system when emergencies happen (such as being asked by Commerce One to move from xCBL 2.0 to xCBL 3.0, or 4.0). Write a set of procedures. Make sure that a delegate can understand and do the work.

    This brings me to my final piece of advice: document everything you do. Make sure that at the very least you have the following documents in place: the information map, the descriptions of the DTDs, and the procedures for keeping the system alive when you need to make changes. For change, you surely will.

  • More Stories By Jim Gabriel

    Jim Gabriel has authored tens of thousands of pages of technical documentation, ranging from entry-level tutorial material to programmers' reference manuals. He is literate in XML, SGML, and XSL, among others.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    IoT & Smart Cities Stories
    Dynatrace is an application performance management software company with products for the information technology departments and digital business owners of medium and large businesses. Building the Future of Monitoring with Artificial Intelligence. Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more busine...
    If a machine can invent, does this mean the end of the patent system as we know it? The patent system, both in the US and Europe, allows companies to protect their inventions and helps foster innovation. However, Artificial Intelligence (AI) could be set to disrupt the patent system as we know it. This talk will examine how AI may change the patent landscape in the years to come. Furthermore, ways in which companies can best protect their AI related inventions will be examined from both a US and...
    Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
    Chris Matthieu is the President & CEO of Computes, inc. He brings 30 years of experience in development and launches of disruptive technologies to create new market opportunities as well as enhance enterprise product portfolios with emerging technologies. His most recent venture was Octoblu, a cross-protocol Internet of Things (IoT) mesh network platform, acquired by Citrix. Prior to co-founding Octoblu, Chris was founder of Nodester, an open-source Node.JS PaaS which was acquired by AppFog and ...
    The deluge of IoT sensor data collected from connected devices and the powerful AI required to make that data actionable are giving rise to a hybrid ecosystem in which cloud, on-prem and edge processes become interweaved. Attendees will learn how emerging composable infrastructure solutions deliver the adaptive architecture needed to manage this new data reality. Machine learning algorithms can better anticipate data storms and automate resources to support surges, including fully scalable GPU-c...
    Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
    Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: Driving Business Strategies with Data Science," is responsible for setting the strategy and defining the Big Data service offerings and capabilities for EMC Global Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He's written several white papers, is an avid blogge...
    The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
    Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
    Whenever a new technology hits the high points of hype, everyone starts talking about it like it will solve all their business problems. Blockchain is one of those technologies. According to Gartner's latest report on the hype cycle of emerging technologies, blockchain has just passed the peak of their hype cycle curve. If you read the news articles about it, one would think it has taken over the technology world. No disruptive technology is without its challenges and potential impediments t...