CTO Articles

Home > News > CTO Articles

Published in IT World
August 22, 2006

Converting from square pegs to round holes

High on my list of worrying one-liners I hear in my day job is the phrase "let's just convert the data from format A to format B".

I think it is really important to differentiate a number of distinct outcomes from any conversion from A to B that depend on the nature of A, the nature of B and your overall goals.

In the SGML days that predate the XML days, we had a classification scheme for data conversions that split the overall problem into three types: up-translation, down-translation and cross-translation. This classification is also occasionally used today in the XML world but it deserves to be better known. Why? Because people can find themselves agreeing that format A should be converted to format B yet have very different expectations about the outcome of such a conversion. This can result in nasty surprises.

Some examples will illustrate the three types of conversion. If I take a spreadsheet table in Microsoft Excel format, consisting solely of plain numbers and save it in CSV format, this is known as a cross-translate. The reason for this is that I have not lost any important information. I can re-import the CSV file and end up back where I started with no loss of information.

On the other hand, if I take a spreadsheet table in Microsoft Excel format, consisting of numbers and formulas and I save it as CSV format, this is a down-translate. The reason for this is that I have lost information - namely the formulas - in the conversion. If I re-import the CSV file I will get the numbers back but I will have lost the formulas.

If I take a CSV file and attempt to produce a Microsoft Excel spreadsheet with both numbers and formulas, this is an up-translate. In order for it to work, I must somehow add the formula information even though it is not present in the source file.

I think you will agree that this classification is straightforward and obvious. Problems arise when two parties to a conversation about conversion have differing opinions as to the type of conversion under discussion. Typically, non-technical folk think that conversions are at least cross-translates, potentially up-translates but never down-translates. Technical folk often see it differently. Conversions are often at best cross-translates but generally down-translates and very rarely up-translates.

This problem bites hardest in the XML world where the phrase "XML to XML conversion" abounds and sounds so straightforward. After all, if I have XML and you have XML then surely a cross-translation is the least we can aim for? Sadly, this is rarely true. Most real world conversions are from square pegs to round holes. In order to do the conversion at all, some loss of the original is inevitable.

The next time somebody suggests a simple conversion to you, I would suggest you ask whether the translation is an up-,down- or cross-translate. If, as is most likely, the conversion has aspects of down-translation then you need to know where the loss of information will occur. If it has aspects of up-translation, then you need to know how the missing information will be added into the output files. During conversations about the latter, bear in mind that "magic" is not a software engineering technique.