harder to catch, but they are also harder to remedy because people who have caught these advanced strains tend to congregate with others with the same diseases and they are continually reinfecting each other.
One of our favorite teaching moments is to start an introductory XML lecture with the statement, “XML is a syntax for trees,” and that this is all there is to it, so no further explanation is required. Of course, there is more to it, and we manage to fill a complete course with it, but the essence of XML really is simple and small. This is elegant to us but a disappointment to many XML beginners who expect something bigger and more complicated to match up with all the hype they have heard. In fact, XML’s character-based format lures many XML beginners to assume they can simply use their trusted text-processing tools, which is the inevitable path to the first XML fever:
Parsing pain. At first sight, XML’s syntax looks as if it would be easy to use simple text-processing tools for accessing XML data, so that a “desperate Perl hacker” could implement XML in a weekend. Unfortunately, not all XML documents use the same character encoding; character references must be interpreted; entities must be resolved; and so on... As soon as the output from a wider array of XML producers is considered, it becomes apparent that for robustly parsing XML with text-processing tools, the tools must implement a complete XML parser. This becomes most painfully evident when XML processing needs to take XML Namespaces into account (often leading to an infection with the intermediate namespace nausea fever).
After overcoming parsing pain and starting to use an XML parser, beginners usually understand what we mean when we say that XML is a syntax for trees, but they do not as quickly grasp that XML uses multiple tree models, and depending on which XML technology one is using, the “XML tree” looks slightly different. Thus, the second basic strain of XML fever is:
Tree trauma. This is caused by exposure to XML’s various tree models, such as XML itself, DOM, the Infoset, XPath 1.0, PSVI, and XDM. All of these
tree models share XML’s basic idea of trees of elements, attributes, and text, but have different ways of exposing that model and handling some of the details. In fact, while XML itself explicitly states that XML processors must implement all of XML (apart from validation, the standard has no optional parts, which is a smart thing for a standard to do), some of the more recent tree models exhibit the “extended subset” nature of technologies, which can often lead to incompatibilities among implementations. For example, PSVI— the data model of an XML document validated by an XML Schema (for the rest of the article, we refer to W3C’s language as XSDL)—is based on the Infoset, which is a subset of the full information of an XML document, and extends that subset with information made available by the schema and the validation process.
While XML is available in a number of various “tree flavors,” the W3C has settled (after a very long process) on the Infoset model as the core of many XML technologies. This means it would be technically more accurate to say that most XML technologies available today are actually Infoset technologies. XML has become one way (and so far the only standardized one, but with the upcoming binary Infoset format EXI as a more compact alternative) of representing Infosets. Of course, the W3C does not want to give up the brand name of XML and still calls everything “XML-based.” As a result, XML users can easily get affected by a peculiar ailment:
Infoset ignorance. Instead of XPath, XSLT, and XQuery, these technologies’ proper names would be IPath, ISLT, and IQuery, because they are Infoset-based. Victims of Infoset ignorance take the W3C’s branding of everything as XML at face value and sometimes invest a lot of energy trying to build XML processing pipelines that preserve character references and other markup details. Infoset ignorance prevents its victims from seeing that this approach cannot succeed as long as they are using standards-based tools.
The remedy for Infoset ignorance is to select a set of XML technologies with compatible tree models. This usually also cures tree trauma, because now XML users can focus on a specific variety of XML tree. Depending on the specific
technologies chosen, though, tree trauma can metastasize into a more severe disease caused by failure to appreciate the somewhat obscure ways in which some XML technologies process trees:
Default derangement. Tree trauma can develop into default derangement if XML users are exposed to and experiment with schema languages such as DTDs and XSDL that allow default values. These languages cause XML trees to change based on validation, which means that XML processing is critically based on validation. Because it is often not feasible to quarantine XML users to keep them from these schema languages, a better prescription is to put them on a strict diet of design guidelines to avoid these potentially dangerous features.
Among the core components of virtually all XML scenarios today are XML Namespaces. They are essential for turning XML’s local names into globally unique identifiers, but the specifics of how namespaces can be declared in documents, and the fact that namespace names are URIs that do not need to be dereferenceable, have not yet failed once to confound everybody trying to start using them. A very popular XML fever thus is:
Namespace nausea. No matter how often we try to explain that XML Namespaces have no functionality beyond the simple association of local names and namespace names, many myths and assumptions surround them. For example, many students assume that namespaces must refer to existing resources and ask us how to “call the namespace in a program.” And even though they should be simple, XML is often serialized by tools that do not allow much control over how namespaces are treated, creating XML documents that exhibit various kinds of correct but very confusing ways of using namespaces. A particularly nasty secondary infection caused by namespace nausea can be contracted when using a specific kind of XML vocabulary:
Context cataracts. If QNames (the colon-separated names combining namespace prefixes and local names) are allowed to appear as content of XML documents (such as in attribute values or element content), they make the content context dependent. This means that such XML content can be
References:
Archives