|
|
CTO ArticlesIT World Open source is great but we really need open dataBy Sean Mc Grath Some time ago I heard a radio interview with John Searle[1], the philosopher from the University of California at Berkeley. After a perfunctory gambit of pleasantries from the interviewer, conversation jumped straight into a request for an explanation of what is known as the Chinese Room argument[2]. The Chinese Room argument is a thought experiment invented by Mr. Searle as a means of debating issues in Artificial Intelligence (more on this in a moment). With as much grace and patience as he could muster (I'm paraphrasing here), Mr.Searle pointed out to the interviewer that the Chinese Room argument is but one very small part of his oeuvre and that it was a long time ago, and that there is so much else in his work . . . and then humored the interviewer with a short explanation of the Chinese Room argument. Note to Searle fans - I know that there is much more to Searle's thought and work than the Chinese Room argument. But, alas and alak, like the Dr. Spock dualism that plagues Leonard Nimoy, I suspect that Mr. Searle will always be associated with the Chinese Room Argument in the minds of the masses. (To my mind the most interesting aspect of Searle's work - and the part that will be most relevant to enterprise IT - is Speech Act Theory[3]. More on that subject in a future column.) Now fear not gentle reader! Fear not the whooshing noise of dense philosophical argument. This column has essentially *nothing* to do with Artificial Intelligence. Relax. There is no need to let your eyes glaze over just yet. Today, I want to talk about something much more mainstream and relevant to the daily grind of commercial IT and I propose to use a variant on Searle's Chinese Room Argument along the way. First, a ruthlessly short and necessarily selective explication of the Chinese Room Argument. Picture the scene. You are in a closed box shut off from the outside word. The box has two holes, large enough for cards with symbols written on them to pass in and out of the box. You are a monolingual English speaker. You have a set of rules in your head that tell you what card(s) to send out of the box based on the cards that are sent in to the box. The symbols on the cards are in Chinese script. The "algorithm" you are executing - matching symbols and applying rules - results in the person outside the box - a Chinese speaker - concluding that the entity inside the box can read Chinese script and understand Chinese. You can't. Does it mean that the box 'understands' language in any meaningful sense? Let us skip the avalanche of philosophy about AI that rumbles around this analogy and ask a completely different question of this scenario. What if the symbols going in and out of the box represent arbitrary pieces of data in your organization? Imagine if all the data going in and out of the box were represented semantically on the cards. Perhaps in XML conforming to a clearly documented schema. That is to say, any IT person from your organization can read and understand the meaning of the data by simply looking at the cards. In your capacity as someone charged with performing data processing for your organization, your concern is what data goes in and what data comes out of the box. Your primary focus is that data. Do you care what goes on inside the box? I contend that although you may care what is inside the box, your primary concern is the data that goes in and out of the box. If you do not control that aspect of the data processing function you have a real problem. The problem is that you don't understand (and therefore don't really 'own') your own data. You can only interpreted it with recourse to some black box that interprets it for you. Not good. Shouldn't that issue take precedence in your mind over any desire to know what is inside the box? Which is more important to your business - that your data be 'open' or that your application programs be 'open'? I would suggest that the former is the case in the majority of businesses. And yet, we live in an age where open source - not open data - is the hot topic. Yes, I know that there are numerous benefits to open source from quality, reliability, longevity, risk control, and umpteen other perspectives. Believe me, I'm a believer in open source. I use lots of it and I contribute some as best I can, when I can. However, with my business hat on, I am primarily focused on open data and only secondarily focused on open source. My first question with any piece of software, open source or closed source, free or expensive, is this - "what data goes in and what data comes out? Do I understand everything about the data on both sides without having to ask the box again?" If I fully understand the data on both sides of the box, I'm happy. After that, access to the source is sure nice but it is not a must have. I have seen open source software where the data may as well have been completely proprietary, binary goo, for all the meaning I could attribute to it outside of the application that created it. Equally, I have seen completely closed box, proprietary systems with very open data interfaces that made it easy to integrate the proprietary system with other open/closed systems. The power to do integration comes primarily from full disclosure of the data - not from full disclosure of the algorithms. Open source is great and may well take over the world but what we really need is open data. It is worth remembering that open data is not an automatic byproduct of open source and that closed data is not an automatic byproduct of closed source. [1] http://ist-socrates.berkeley.edu/~jsearle/
|