CTO Articles

Home > News > CTO Articles

IT World
XML IN PRACTICE --- 12/20/2001

XML and the Humble Paragraph Tag

By Sean McGrath

Nothing in the world of XML seems as harmless as the <p> tag; a universally accepted way of saying "here is a paragraph". A concept familiar to anyone with even a passing familiarity with the Web. It's as if the <p> tag has always been with us, a fundamental truth, part of the fabric of the universe. Discovered rather then invented. Simple, elegant, perfect....

Of course, the <p> tag is too simple for some pedants who insist on using <para>, or even <paragraph>, tags. Such pretensions! Plain country folk like me like our beer cold, our apple pie warm, and our paragraphs surrounded by good 'ole <p> tags just like grandma used to make. However, peel off a layer or two and those little old <p> tags shows their teeth, revealing a vista of complexity that is at the heart of the "XML for data" versus "XML for documents" debate.

There are two features that differentiate data oriented XML and document oriented XML. Firstly, the depth of tagging is irregular and unbounded in "documents". In data oriented XML, the tagging is regular and bounded. All tags occur in the same order, record after record, and the same depth of tagging is used throughout. Secondly, plain text can be intermixed at the same level as tags to create what is called "mixed content" in "documents". In data oriented XML, everything is tagged; there is no free standing text and thus no mixed content.

In both cases, the <p> tag is center stage. In fact, if you see a <p> tag in an XML document, then you can infer a lot about the type of issues you are likely to face. If you find <para> or <paragraph> tags, then you know that the issues are the same but you are also dealing with a pedant.

  • If <p> tags occur in an XML document, you can be pretty sure that it will not be possible to treat the data as a collection of records. In particular, it may prove impossible to get the contents of any particular tag easily. Upside down, even-driven programming typically results.
  • If <p> tags occur in an XML document, you can be pretty sure that white space is significant in some places. In other words, it will not be possible to simply strip any white space surrounding tags without potentially damaging the content.
  • If <p> tags occur in an XML document, you can be pretty sure that typography issues will be troublesome. The "paragraph" is the fundamental block of text to which rendering engines flow and present text. However, the print world has long worked on the basis of "margins" often measured in tiny fractions of an inch, to specify locations of paragraph.

On the Web, where sub-millimeter control over paragraph layout is neither practical nor desirable, an alternative paragraph-positioning model is needed. The answer, to date, has involved the aid of the single most abused element in the HTML tag bag -- the table. Much to the chagrin of typographers and XML data modelers alike, the border-less table has replaced pretty much every other geometry model for laying out paragraphs of text.

CCS2 has made it possible to exert fine control over paragraph positioning using pre-Web methods such as left indent, negative first line indent, and so on. However, until the likes of CSS2 becomes standard in all browsers, we are likely to see table trickery remain.

So, in summary, the <p> tag is not so simple. Its presence or absence tells you a lot about the type of XML you are dealing with, not to mention the world-view of whoever created it. If you work purely with data-oriented XML, you may never come across them but if you work with document oriented XML, then they will be a source of constant trouble and complexity, but also endless fascination for us easily amused doc- heads!


Sean is co-founder and Chief Technology Officer of Propylon and is an industry–recognised XML expert.