Ampersand Attrition in XML and HTML
By Sean McGrath
Have you ever played the "spot the development platform" game? In my version of it, points are awarded to players who correctly guess what programming language an application is written in, simply by looking at the user interface of the application.
Many tell tale signs can be spotted ranging from the shape of the hover-text that appear on buttons through to the general pattern of URLs generated in HTTP GET requests. Visual C++ thick client binaries, Vignette, and JBOSS all have pretty distinctive attributes that can be perceived on close inspection of a running application's GUI.
With XML and HTML, a more challenging game is possible, namely, "diagnose the problems with ampersand characters". Note that this game is about diagnosis not detection. Detecting problems with ampersand characters in XML/HTML applications yield no prizes because ampersands in XML/HTML applications *always* cause problems.
Just now, I searched for the string "amp;amp" with Google and received about 22,000 hits. If you are interested in this phenomenon, then I'd suggest following some of the links, viewing the source, and marveling at the number of "amps" on show. Sometimes you'll find a single amp, and other times as many as twenty!
The Root Cause of the Problem
The ampersand character has special meaning in SGML, HTML, and XML markup languages. If you wish to use it literally, you must "escape" it. The escaped form consists of an ampersand sign (! -- more on this later), the string "amp", and a semi-colon. However, a literal ampersand sign can occur within an XML document without causing parsing problems in certain cases. For example, they can occur inside CDATA sections and inside comments in un-escaped form; they are used to introduce "entity references" for special characters such as "lt" for less than and "quot" for single quote; they are also used to introduce so called "character entities" such as "#x0041", which is the Unicode code for a capital A character.
The multiple uses of ampersand characters -- some special, some not --are the cause of the trouble. Let us say you are in the process of adding markup to a document. It does not parse yet, so you are doing all your text processing lexically (i.e. by editing with a text editor or performing string processing using some sort of search/replace or regular expression capability). You know that some literal ampersands are scattered throughout the document's text so you fire off a search/replace to escape them all.
Trouble is, if there are any ampersands in CDATA sections, or comments or introducing entities, they are also escaped -- causing "amp;" to appear in your final output. Furthermore, any ampersands in the true text of the document that had already been escaped would then be double escaped -- thus again causing "amp;" to appear in your final output. This process may be repeated, depending on the number of steps involved in the document production workflow. Like wood-rings in a felled tree, you can get a feel for the number of seasons in a document workflow by seeing how many times, erroneously escaped ampersands are escaped! What is the essence of the problem here? Why is it that, after all these years of SGML experience, ampersand attrition rates are still so dreadful? I suspect the problem is a parallel of the problem in the Unix world known as the "two to the n minus 1 backslash problem".
In Unix, a backslash has special meaning in numerous contexts. To escape it, you add another backslash. However, if you are creating syntax for a command that will pass through a couple of backslash sensitive layers before hitting its final target, you need to escape the backslash by adding backslashes. If there are 2 intermediate layers, you need 3 backslashes. For 3 layers you need 5 and so on.
In both ampersand escaping and backslash escaping, we see the same phenomenon. Namely, the character to be escaped is, itself, used in the escaping mechanism. In the case of ampersands, the first character in the escape sequence is *another* ampersand. In the case of backslashes, the escape sequence is *another* backslash.
The last thing you want to do if you find yourself in a hole is to keep digging. It seems to me that this is exactly what the ampersand escape mechanism does by adding more ampersands.
Now We Know the Problem...What's the Cure?
So how should this be fixed? Can it be fixed? I'm not sure but one idea, which I believe bears investigation is the use of a pre-defined empty element type in XML, called amp in the XML namespace bound to the reserved prefix "xml:". That way, I can represent a literal ampersand in text as <xml:amp/>. In so doing, I would be able to cleanly separate literal ampersands in the text of a document from ampersands that are part of the surface syntax of the markup.
It will not have escaped your attention that this article does not contain a single literal ampersand. To do so would be to invite Murphy to mess one up. That would make this article self-referential in a way I would rather avoid. There is enough ampersand attrition in the world without this article adding to it!
MiniChapter 6: Clean Code, Comments, and Escape Codes http://itw.itworld.com/GoNow/a14724a60949a114515055a1
Using Recursion to Do Most Anything http://itw.itworld.com/GoNow/a14724a60949a114515055a0
Entities and XSLT http://itw.itworld.com/GoNow/a14724a60949a114515055a2
Sean is co-founder and Chief Technology Officer of Propylon and is an industry–recognised XML expert.