Friday, February 29, 2008

Good advice for creating XML

The use of XML has become widespread, but much of it is not well formed. When it is well formed, it's often of poor design, which makes processing and maintenance very difficult. And much of the infrastructure for serving XML can compound these problems. In response, there has been some public discussion of XML best practices, such as Henri Sivonen's document, "HOWTO Avoid Being Called a Bozo When Producing XML." Uche Ogbuji frequently discusses XML best practices on IBM developerWorks, and in this column, he gives you his opinion about the main points discussed in such articles.

I have been discussing XML best practices in this column and in other series for years. Others, such as fellow columnist Elliotte Rusty Harold, have covered it as well. The more XML experts that join the discussion of XML design principles, the better, so the community can converge on solid advice for developers at all levels of XML adoption. In this article, using a recent document and a classic one, you learn more details about XML best practices.

Enter the no bozo zone

Henri Sivonen wrote a useful article, "HOWTO Avoid Being Called a Bozo When Producing XML" (see Resources). Adopting the perspective of XML-based Web feed formats, such as RSS and Atom, he goes over his Dos and Don'ts for producing well-formed XML with namespaces. As he says in his introduction:

There seem to be developers who think that well-formedness is awfully hard -- if not impossible -- to get right when producing XML programmatically and developers who can get it right and wonder why the others are so incompetent. I assume no one wants to appear incompetent or to be called names. Therefore, I hope the following list of Dos and Don'ts helps developers to move from the first group to the latter.

The first bit of advice Henri gives is, "Don't think of XML as a text format." I think this is dangerous advice. Certainly his main point is valid -- you cannot be as careless in producing or editing XML as you would a simple text document, but this applies to all text formats with any structure. However, saying that XML is not text is denying one of the most important characteristics of XML, one that is enshrined in the very definition of XML in the specification. ("A textual object is a well-formed XML document [if it conforms to this specification.]") Henri's statement is also confusing because there is a technical definition of text in XML that is essentially the sequence of characters interpreted as XML. Text is not merely what goes within leaf elements or within attributes -- technically called character data. Text is the fundamental fabric of all XML entities, so to say that XML is not text is a contradiction. I think it's more useful to highlight the specific ways in which XML differs from text formats with which developers might already be familiar.

This comment is an example of how Henri's advice is colored by his interest in the problem of generating well-formed Web feeds. He is right to warn people that carelessly slapping strings together and hoping they are well formed is a dangerous course. I too have written articles advising people to use mature XML toolkits rather than simple text tools when generating XML (see Resources). My concern is that the way in which Henri couches this advice is a bit confusing and could be misconstrued in the broader context of XML processing. He reiterates his advice in the sections, "Don't use text-based templates" and "Don't print". I think this should be summarized as: "Do not use mechanisms that you're not sure will result in well-formed XML." That's very important advice indeed. One approach to safe XML generation is sending SAX events, as Henri suggests in, "Use a tree or a stack (or an XML parser)." If you do so, however, do not assume you are home free. The SAX tools you use might not do all the necessary well-formedness checking. For example, some Unicode characters are not allowed in XML. You may need an additional level of checking to account for such issues.

Henri rightly suggests that users not try to manage namespaces by hand. As I've discussed on developerWorks, XML namespaces require a great deal of care. His suggestion that developers only think in terms of universal name [namespace Uniform Resource Identifier (URI) plus local name] is generally sound, but sometimes a developer cannot avoid dealing with prefixes or XML declarations. In specifications, such as XSLT, a QName (prefix/local name combination) can be used within attribute values, and the prefix is supposed to be interpreted according to in-scope namespace declarations. This kind of pattern is called a QName in context. In this case, the developer must have control over the declared prefix or the resulting XML processing will fail. When developers do manage their own namespace declarations, the result is often messy because of the complexities of XML namespaces.

One way to clean up namespace syntax that might become messy while passing through a pipeline of XML processing is to insert a canonicalization step to the end of the pipeline. XML canonicalization eliminates the syntactic variations permitted by XML 1.0 and XML namespaces, including different namespace declaration patterns. Canonicalization will not eliminate all the issues that make namespace declarations treacherous to developers. Canonicalization does not help with QNames in context problems since it does not change the prefixes used in a document, but it does reduce the mess of namespace declarations to the point where you can easily spot problems or even write code to automatically fix them. The GenX library, which is one of the XML generation options Henri suggests, automatically generates canonical XML, and many other toolkits provide canonicalization as an option.

Henri's advice about Unicode and character handling is almost completely sound. However, in "Avoid adding pretty-printing white space in character data," I think the case is a bit overstated. Pretty-printing XML is safe in most cases between elements, rather than within elements with character data. As Henri says, if you have the XML in Listing 1, it is usually not safe to render it as in Listing 2.

Listing 1. XML sample

bar



Listing 2. XML sample with white space added to character data


bar



But it is usually safe to pretty-print the XML in Listing 3, so that the output is as in Listing 4.

Listing 3. Another XML sample

bar



Listing 4. XML sample in Listing 3 with white space added to character data


bar



Many XML serializer tools understand this distinction between relatively safe and relatively unsafe pretty-printing. It is important to understand that the form of pretty-printing shown in Listings 3 and 4 can cause distortion if white space is added to mixed content. Such problems can be avoided if the serialization is guided by a schema. In practice, though, most vocabularies that use mixed content are not so sensitive to white space normalization, so don't worry too much about pretty-printing. You should be knowledgeable of the issues, and be sure there is an option to turn pretty-printing off (preferably the default should be to not pretty-print). Henri recommends a pretty-printing practice as in Listing 5, but I disagree because I think it makes for ugly markup that's not friendly to manipulation by people.

Listing 5. Pretty-printing convention suggested by Henri Sivonen but not recommended by this author

>bar>


From the monastery

Switching to a very different speed, the second resource I shall explore in this article is Simon St. Laurent's "Monastic XML" (see Resources). This is a collection of brief essays with advice on how to process and even think about XML for maximum effect. Simon uses the metaphors of monasticism and asceticism to suggest that it is dangerous to load XML too heavily with baggage that does not suit its simple, textual roots. In "Marking-up at the foundation," he discusses the fundamental roles of character data and markup (elements and attributes). In "Naming things and reading names," he explains why the generic identifier (also called the element type name) is an important concept and how it should be the sole primary key to the structure of the marked-up information. Realistically, if you're using XML namespaces, the primary key is the universal name (namespace URI plus local name), and this complication is one of the reasons Simon urges caution in "Namespaces as opportunity." "Accepting the discipline of trees" calls out one of XML's dirty secrets: Even though it seems that XML's hierarchical structure could be easily extended to graph structure, in practice, the modeling of graphs in XML has proven a bit difficult. But by far the most important lesson on the "Monastic XML" site is found in "Optimizing markup for processing is always premature." XML is a declarative technology, and therein lies its strengths, as well as its frustrations, for many developers. Developers who try to pull XML design too close to the details of processing generally end up making that processing more difficult in the long term. The key to success with XML is to focus on the nature of the information that needs to be represented in the abstract separately from the technical design of the systems that need to process that information.

No comments: