Duck modelling in commercial IT systems
By Sean Mc Grath
Some time ago, I wrote an article entitled 'How to model a bishop'. The thrust of the article is the obvious-yet-profound fact that people model information from their own point of view. That 'point of view' is a smorsgasbord of influences that include training, culture, language, hubris, sloth and so on. This article approaches the same problem of information modelling from another perspective and ends, I hope, with an equally obvious-yet-profound fact.
Let's get started.
One of the great mysteries of human intelligence is the fact that we can somehow synthesize wildly different models of information into crisp sounding concepts such as an 'invoice', or 'the letter A'.
Flick through a good book of fonts and wildly different variations on a theme known as 'the letter A' will hit your retinas. Yet somehow, our brains have no difficultly coalescing these diverse images into a single concept - the letter A - that we recognize with astonishing speed and reliability.
Faced with one hundred different invoice formats, from one hundred different suppliers, our brains (nobody knows how this works) know they are all basically the same thing. Variations on a theme we call 'invoice'.
Douglas Hofstadter has written an engaging analysis of 'the letter A' problem which I would exhort anybody interested in the problem to read. Here, we will concentrate on the more commercially interesting problem of recognizing an invoice when we see one.
What is an invoice? Don't all shout together now! Write down, on a mental notepad, the concepts you believe need to be present for something to qualify as an invoice. Let's go through it. Does your model have a sender? It does? Great. What did you call the invoice sender in your model? What is it comprised of? Is it the name of a person, a process or a thing? All of the above? None of the above? Some of the above? Let's look at 'person' for a moment. What is a person from the point of view of an invoice? A real person or a business role? Does it have a name? How is the name structured? Does it have an address? A title? What is a 'given name' anyway? How is the address structured? Is it a place, a geo-code or a business process name? Is a zip code mandatory or optional? What about city? What is a city anyway?
And so it goes. Down and down we go into the subatomic world of invoice particle physics. The closer we look, the more we find that the model refuses to bottom out, refuses to condense into something solid that we can model rigorously. And yet, paradoxically, our brains can recognize an invoice as an invoice in a split second.
All around the world, as I write this, developers are struggling to create models for invoices and other "simple" business documents into IT systems. All around the world, multiple efforts new and old continue to attempt to zoom in on a definitive model of what it is to be an invoice. Also all around the world, retired developers in Zimmer frames and comfy shoes remember the good old days when they too chased such modelling rainbows.
The classical approach to data modelling - as enshrined in techniques such as data dictionaries, object models and XML schemas - is to model the data rigorously from the top down. Every thing in the model has a name. Each thing is either a simple lump of data or a complex thing. Complex things, themselves have names and models. And so it goes.
The latest silver bullet of the classical approach - XML schemas -illustrate the genre very well. You start at the top concept "invoice". You break it down into its component parts known as "elements", say, "header" and "body". You break these down further into more elements. For example, a "header" has a sender element, a receiver element, a date element. A sender element is comprised of...and so it goes.
The trouble is, this modelling exercise never ends. The essence of an invoice refuses to be modelled. Every model, to paraphrase Oscar Wilde, becomes a work of art that is never finished, simply abandoned. If this were not the case then surely we would have a definitive invoice model by now? How come our planet is so chockablock with mutually incompatible, application-specific models of invoices? How comes new ones appear every second day?
I have a suggestion that explains the situation. It is radical sounding at first but please bear with me.
There is *no such thing as an invoice* in the classical data modelling sense.
What does your brain do when it sees an invoice? It scans the page picking up clues as to what is on the page. Something that looks like a sender's details in one place, something that looks like a receiver's details in another. A bunch of what look like products/services with amounts in another. Some sort of total near the bottom. Perhaps some terms and conditions on the back. The more of these subsidiary structures you recognize, the higher your confidence level that you are dealing with an invoice.
The important thing is that you never actually decide that it is definitely an invoice, rather, you decide that it is more like an invoice than any other document type you know. Invoice-ness is statistical, fuzzy, fluid. It is not solid, not deterministic.
There seems to me to be a fundamental mismatch here between the classical software model of an invoice and the reality of real world invoices. A mismatch that goes to the heart of our attempts to model data in machine readable form. A mismatch that is reminiscent of the classical 'billiard ball' model of particle physics versus the quantum model.
This mismatch is not of mere intellectual interest. It runs to the heart of the difference between a truly flexible computer system and one that just pretends to be flexible. True flexibility comes from the ability to adapt to changing business needs. Invariably, this involves the ability to adapt to changing data models. Systems designed with the classical data model approach are highly resistant to change. These systems forge rock hard data models based on rigid top down analysis of the elements present in, say, invoices. Then they bury these rigid models deep into the core of the system by compiling programs in Java or C# or whatever that are intimately tied to these models.
Worrying isn't it? The standard approach to data modelling would seem to be antithetical to flexibility. The answer to this conundrum is not yet on the horizon. Indeed, recognition of the problem is not yet widespread in the industry in my experience. We hear lots of nice words like 'flexibility' and 'loose coupling' but they tend to have vague, abstract definitions.
Amongst those who do recognize the problem, an amusing, and very useful phrase can often be heard - 'duck typing'. So called after the way we humans recognize ducks. Namely, if it walks like a duck, quacks like a duck, it *is* a duck.
Think back to how you recognized that invoice as an invoice. You found a bunch of attributes which you associate with invoices. You found enough of them in your analysis of the piece of paper to conclude that it was statistically speaking, more likely than not to be an invoice. It if walks like a duck...
This obvious-yet-profound idea is the route to our salvation. In the XML world, which seems set to be a hot bed of work in this area, there is an increasing recognition that so called 'grammar-based' approaches to data modelling have weaknesses as well as strengths. Alternate approaches that are more duck-typing in their approach such as Schematron are growing in popularity because of the business benefits that accrue from the extra flexibility they provide. At the same time, so called 'dynamically typed' programming languages such as Python/Jython are becoming increasingly popular for XML processing. Again because of the flexibility that their duck-typing provides. I hope that long before I take delivery of my own Zimmer frame, dynamic typing will be the standard way to model data. I can see no other way to keep up with the incessant demands for flexibility required of commercial IT systems.