Binary Data Formats? Just Say NO!
By Sean McGrath
What does the phrase "binary data" or "binary format" mean? Call me a pedant, but I was under the impression that all digital data is binary data. Binary means "expressed in ones and zeros" right? Given that, does it not follow that all digital data without exception is binary data
The truth is, the phrase "binary data" is a misnomer. It is used as the opposite of the phrase "plain ASCII". In other words, data that cannot be grokked with a simple text editor. As XML takes hold to become, to use Tim Bray's characterization, "the new ASCII", binary data will be re-interpreted to mean the opposite of "plain XML". It is still a misnomer of course because all XML is made up of binary data but we are clearly stuck with the phrase at this stage.
Why would an application designer use a binary format to store data? The single most common rationale for not using plain old XML goes something like this:
"The application needs to be fast and efficient, therefore it stores data in a binary format rather than XML."
This is often both a non sequitur and a ruse. Let start with the non sequitur.
The PC I am writing this article on is really fast. It has a 2000MHz processor that spends most of its time doing precisely nothing. Raw CPU power is simply not a scarce commodity. I'm not saying that I have enough power and that my applications go fast enough. That is never the case. However, the bottlenecks that slow my applications down these days are not related to lack of CPU power.
As for efficiency, yes, we all care about how much disk/bandwidth our data uses. However, we live in a world in which compression algorithms such as the ubiquitous ZIP format have been commoditized. Given that, the efficiency arguments against native XML storage can be dealt with by simply ZIPPing XML to/from disk. In other words, we can get the plain text, benefits of using native XML data formats and yet have disk/bandwidth efficiency at the same time.
Case in point: OpenOffice. Create a document with its word processor and save it to disk. The file saved to disk is actually a simple zip file. Open it with any zip reader. You will see some XML files. There is an XML file the text of the document; an XML file for the style information; an XML file for the metadata and so on. The result? Efficient storage with plain text XML encoding of the information. Beautiful!
Now lets turn to the ruse that often underlines the rationale for binary data formats. Here is my definition of "binary data": data owned by somebody else. Namely, the entity that created the program that reads/writes the data in a form I don't understand and cannot read.
It's a sobering thought. Who really owns the data on your disk that is stored in application-specific binary format? You or the application that created it? With OpenOffice, I can dip into my data with commodity software. I can write single page Python applications that do useful things to my OpenOffice XML data without having to learn any application APIs or buy any software. I feel a higher degree of ownership and control over the data I wield with OpenOffice. I like ownership and control. Do I have ownership and control of the native binary formats on my machine? In a word, no.
The OpenOffice approach combines openness with efficiency in a beautiful way. I hope that it starts a trend. A trend that will hopefully start the process of exposing the binary data ruse for what it is: a way of ensuring that your data is locked into the application that created it.
In a world with commoditized XML and compression mechanisms, is there any justification for binary, proprietary data formats? Can we look forward to a day in which the only binary files on our disks are compressed archives of XML or throwaway compiled files created at required from the native XML? I don't see why not.
Binary data? Just say NO!
Sean is co-founder and Chief Technology Officer of Propylon and is an industry–recognised XML expert.