|
|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS XML Tips
Declaring Attributes And Entities In DTDs
By: Steve Hoenisch
Digg This!
Introductions to XML all too often ignore the power of the attribute. It gets neglected in favor of the element's ability to capture the structure of a document or the meaning of content. But in developing flexible, reusable document models and in capturing metainformation about structure or content, the attribute's overlooked utility quickly comes into focus. Overlooked, too, have been entities, with few introductions to XML freeing them from their shroud of mystery. They are, however, a powerful method for reusing content or code, both in documents and, as we'll see, in DTDs. To help you tap the potential of attributes and entities and use them in your XML documents, I'll explain how to specify them in a document type definition. In discussing attributes and entities, this article forms the second installment in a two-column primer on creating DTDs. After a quick review of what my last column (XML-J, Volume 2, issue 8) covered - the element-specific aspects of a DTD - the tutorial will guide you through the process of defining attributes and entities, culminating in a sample DTD for résumés.
Document Type Definitions A document type definition, then, lays out the underlying grammar that describes the ways data may be structured in an XML document. Or, to put it more practically, a DTD defines how tags may be applied in a given document. To do so, DTDs combine syntax and operators to form explicit rules that mainly do the following:
Review In a DTD the declarations for elements, attributes, and entities are prefaced in the same way, with an opening angle bracket followed by an exclamation mark: <!. After the exclamation mark comes one of three keywords written in all capitals - ELEMENT, ATTLIST, or ENTITY - that specify the type of declaraction. The declaration keyword is followed by a rule. Let's examine the rule syntax for elements. Say you want all the XML documents in a set of résumés to contain a root element named résumé. To declare an element, you use the <!ELEMENT> declaration: <!ELEMENT resume ANY>This statement declares that, as the DTD keyword ANY indicates, the element résumé may appear with any combination of text and child nodes. The syntax of the <!ELEMENT> declaration is this: <!ELEMENT elementName {rule}>where {rule} may be replaced either by a DTD keyword in all capitals, such as EMPTY or ANY, as I did above, or by a parentheses-enclosed rule with one or more child elements separated either by commas, indicating that their appearance is required in the order specified, or by a vertical bar, indicating choice. In our example with the résumé element above, the syntactic slot for the rule filled by the keyword ANY is used to describe the legal content - a content model - of the résumé element. In this case, because I used the keyword ANY, it can be anything from child elements to text. But using the ANY does little to constrain our data, defying the purpose of a DTD, which you may recall is to make explicit the permissible relationships and associations among the data in a set of XML documents. Thus, in the rule slot where I've used the keyword ANY, better constraints can and should be put in place. Instead of simply saying that the résumé element can take any combination of child elements and text, the rule should state exactly what children the element may contain and in what order they may appear. For instance, if I determine that every résumé in a set must have an element with children in a particular order, I can specify the child elements and their ordering by placing the child elements in parentheses and separating them by commas: <!ELEMENT resume (name, contactInfo, experience, education) >This DTD rule says that the résumé element contains the children enclosed in parentheses. The commas in the rule stipulate that the elements must apear in the sequence in which they are listed. No other child elements or text are permitted directly under the résumé node. Such a rule, though, says little about the frequency with which each child element may appear. A set of DTD operators, called occurrence operators, lets developers generalize about the optionality and frequency of elements. If no occurrence operator is used, as in the rule above, the default is exactly one occurrence. The other occurrence operators are as follows:
<!ELEMENT experience (position, company, location?, task+, note*) >You can also use parentheses to nest elements within a rule and then apply any of the occurrence operators to all the elements within the nested set. Example: <!ELEMENT location ((street, suite)?, city, (state, zip)?)>This rule says that the (street, suite) sequence is optional but may not be repeated (note, though, that if the street element is used, the suite element must be used too). Ditto with the (state, zip) sequence. The city element, however, must appear exactly once, and if the other elements are used, they must be positioned according to the order dictated by the comma-separated list. You may also nest a choice of elements by placing them in parentheses and separating them by vertical bars instead of commas. For instance, the following rule says that the name element must contain a first name, optionally followed by a choice of a middle name or a middle initial, or by a nickname, followed by a required last name. <!ELEMENT name (firstName, ( (middleName | middleInitial)? | (nickName)? )*, lastName)>This can get complex quickly. For details on how to use parentheses to form complex rules, see Chapter 3, "Document Type Definitions," in XML in a Nutshell, written by Elliotte Rusty Harold and W. Scott Means (O'Reilly). Many narrative-oriented XML documents like books and news stories allow mixed content - a choice of either text or child elements or a combination of both. For instance, our sample résumé has an as-yet undefined element called <task>. The purpose of the task element is to describe what the résumé's author did in a given job. This job description contains paragraphs, and the paragraphs may contain either text or child elements that themselves contain text. For instance, within a paragraph you may want to include tags that mark text for emphasis, set off citations, and allow you to insert a line break. Here's an example of such markup (but without the line break element): <paragraph>Served as consultant for editing, electronic production, and desktop publishing of one financial newsletter, called <cite>Securities Today</cite>, and the organization, design, and launch of two others. </paragraph><paragraph>All three became <emphasis>highly successful</emphasis> publications.</paragraph>To declare a rule that allows content to mix text and child elements as in the example above, you must first declare the text and then list the other elements. Each entry in the rule must be separated by vertical bars to indicate choice, and the rule itself must be marked as optional and repeatable with the asterisk occurrence operator, like this: <!ELEMENT paragraph (#PCDATA | cite | emphasis | br )*>PCDATA, you may recall, is XML's name for standard text. PCDATA stands for parsed character data, which includes regular text characters except <, &, or the sequence ]]>. PCDATA also includes general entities, discussed below. A DTD must declare the content model for each element used in the XML document. Since I've included the cite and emphasis elements in my content model for the paragraph element, I must also declare them. They contain only PCDATA, which, used alone, excludes other element tags: <!ELEMENT cite (#PCDATA )>Finally, empty elements, like my HTML-type <br> element in the rule for the paragraph element, may be declared using the EMPTY keyword: <!ELEMENT br EMPTY>This ensures that no content, whether other elements or parsed character data, may be placed within it. Declaring mixed content can get tricky. The key is to remember these three points:
Attributes <!ATTLIST Web versionIn this declaration résumé is the name of the target element (the element we're declaring the attributes for) and version is the name of the attribute. This declaration is incomplete, however. It needs two other components before it can be completed and closed with an angle bracket. First, it needs a datatype, the simplest and most common of which is CDATA: <!ATTLIST resume version CDATAThe CDATA datatype, which is written in all capital letters in DTDs, specifies that any regular text characters except <, &, or the sequence ]]> may be used in the quoted string that makes up the attribute's value in an XML document. Second, the declaration needs either a default value or an attribute modifier. For simplicity, I'll complete the attribute declaration with an attribute modifier, which must be prefixed with a number sign (#) and written in capital letters: <!ATTLIST resume version CDATA #REQUIRED>This declaration says that the résumé element contains one attribute named version whose value is some combination of regular text characters. A value for the attribute must be present, as stipulated by the #REQUIRED attribute modifier. Thus the syntax of an attribute declaration goes like this: <!ATTLIST targetElement attributeName attributeDatatype attributeModifierOrDefaultValue>As the ATTLIST keyword suggests, you can define multiple attributes within an attribute declaration. The following declaration, for example, declares all three of the résumé element's attributes: <!ATTLIST resume version CDATA #REQUIRED lastUpdated CDATA #IMPLIED field (IT | training | academia) #IMPLIED >In declarations you can use white space to make your rules easier to read. In the example above I've broken up the three attributes that I'm defining by placing the second two on new lines for legibility. The second line defines an attribute named lastUpdated. Its #IMPLIED attribute modifier leaves the attribute's value unspecified, with the practical implication that the attribute and its value may be either present or absent; attributes with the #IMPLIED modifier are ignored by the XML processor unless they're used as part of an element. The third line defines yet another attribute, this one named field, and uses an enumerated datatype instead of the now-familiar CDATA. An enumerated datatype contains a series of values, listed inside parentheses and separated by vertical bars indicating choice, from which only one may be used as the attribute's value in an XML document. A number of other datatypes may be used in DTDs, including the ID datatype, which is used to provide a unique element identifier. Although discussing all the attribute datatypes is beyond the scope of this tutorial, you can find out more about them in O'Reilly's XML Pocket Reference, second edition, and in the chapter on DTDs in XML in a Nutshell. To reinforce this column's introduction to DTDs, I suggest you read Chapter 5, "Document Models: A Higher Level of Control," in Learning XML (O'Reilly). For advanced material on how to build industrial-strength DTDs, see David Megginson's book, Structuring XML Documents, (Prentice Hall).
Entities Entities come in a number of flavors and have a variety of uses. I'll explain how to work with three of them:
For instance, if you've been using to insert a nonbreaking space into your HTML documents, you can do the same in your XML documents by declaring it as an entity in your DTD, with its replacement text being the Unicode decimal format character for a nonbreaking space surrounded by the entity start and end characters of an ampersand and a semicolon: <!ENTITY nbsp " ">This example reveals the syntax for general entity declarations: <!ENTITY entityName "Replacement text">When you use an entity in your XML document, remember to prefix it with an ampersand and to suffix it with a semicolon: &entityName;The power of the general entity makes itself apparent any time you need to reuse the same bit of content, whether text or code - especially if it's likely to change. By using a general entity for the content, you can change it in one place, the entity declaration in the DTD, and have that change take effect every place you've used the entity in your XML documents. Say you're writing a software manual for a new product, and the marketing department, fickle folks that they are, have repeatedly changed the name of the product and, you believe, will probably continue to do so right up to the day the product is launched. To get around having to change the product name manually wherever it appears in all your documentation every time the marketing department changes its mind, simply declare an entity called productName (or whatever) and use it in your documentation each time you need to write out the product's name. The same principle applies to code: general entities may contain XML markup. The markup, however, must be well formed. The ability to include markup in general entities brings us to another type of entity: external parsed general entities. They allow you to separate a chunk of content, whether code or text, into a file that is external to both the XML document where the entity is referenced and the DTD where the entity is declared. External parsed general entities can be particularly useful in working with XHTML to create XML files that will be published on the Internet. For example, if I have some standard XHTML code that I want to use as the header on every résumé that I plan to publish on my XML-driven Web site, I can create the XHTML code for it in a separate file, header.xml. In the DTD I declare the external parsed general entity as follows, depending on where the resource is located: <!ENTITY header SYSTEM "header.xml">or <!ENTITY header SYSTEM "http://www.criticism.com/code/header.xml">Instead of the replacement text found in a general entity declaration, the external parsed entity declaration uses a SYSTEM keyword and the path to and name of the file to allow the XML parser to locate the resource. It can be either a file on the local system, as in the first example, or a resource on the Internet, as in the second example. There are a few nuances to using external parsed general entities that I haven't discussed; for more information I suggest you read the chapter on DTDs in XML in a Nutshell in which the authors discuss a similar example in greater detail.
Parameter Entities Let's return for a moment to the subject of element declarations in DTDs. Recall that in the DTD I'm constructing for résumés I have a paragraph declaration that includes some low-level mixed content: <!ELEMENT paragraph (#PCDATA | cite | emphasis | br )*>Now suppose I want to reuse these child elements for a number of other elements, such as <note>. Instead of writing out the rule each time for the other elements, I can declare a parameter entity containing the elements I want to group together for reuse: <!ENTITY % inline "cite | emphasis | br">Parameter entity declarations contain a percent sign (%) before the name of the entity. The entity's content is then listed inside quotation marks. To reuse the cite, emphasis, and br elements within the note element, I insert a parameter entity reference in the rule portion of the declaration for the note element: <!ELEMENT note (#PCDATA | %inline; )*>The rule contains the parameter entity, but it's prefixed with a percent sign instead of the ampersand used for general entities. Because the rule includes PCDATA, it must conform to the rules for mixed content discussed above. Parameter entities, like general entities, provide a way to create a single source of content that can be reused repeatedly, reducing errors, saving keystrokes, and easing maintenance. If, for instance, I decide after creating my DTD that I also need an inline horizontal rule element called hr to be available at the same level as the line break element (br), I can simply insert it once - in the parameter entity - and instantly make it available everywhere in the DTD that I've used the %inline; content model.
Completing the DTD A DTD in an XML document is called an internal DTD subset. Placing the DTD in an XML document is especially useful during the early stages of modeling the structure of your documents and making your first pass at creating a DTD for them. Note, however, that in internal DTD subsets, parameter entity references can only be used outside declarations; as such, I haven't included any parameter entity references in this DTD. Finally, notice that the internal DTD subset is placed in the <!DOCTYPE declaration after the root element (résumé) and enclosed with square brackets ([ and ]) (see Listing 1). XML JOURNAL LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING XML NEWS
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||