A couple of weeks ago I discovered Clojure. Clojure is a Lisp dialect implemented on top of the Java Virtual Machine with some really interesting persistent data structures. I have to admit that I started looking at because the creator of it is named Rich Hickey (no relation that I am aware of).
Rich is a smart guy and I can recommend his video Simplicity Matters that he gave at RailsConf 2012. He not a particular fan of object oriented programming, primarily because it gets in the way of more direct manipulation of your data (see note below).
So, that sounded good. The things we do are all about data, so why not use the data structures directly rather than through a set of objects? There is certainly no doubt the objects can get in the way. Objects in Python are data, but the typical way of interacting with them is through a little custom API consisting of the object's methods.
I wasn't quite ready to dive into Clojure (not only LISP, but it appears that the deeper your knowledge of Java the better), but it seemed as though you could take the same approach in Python, so I tried recoding the first stage of a reimplementation of WorldCat Identities we are working on. This stage collects heading-specific information such as workIDs, holdings, OCLC numbers, associated names and subject headings into an object called NameInfo.
The NameInfo object is fairly complicated with nested objects that have nested objects with Python sets and lists and dictionaries and Counters inside them. Since we do this processing in Map-Reduce on Hadoop, the objects need to be serialized. We often do this in XML, which means a custom-built writer/reader which can also get in the way of being flexible with the data.
Here's a summary of my reactions after a couple of days coding:
- The resulting code was substantially smaller (factor of 2)
- It was often convenient to act directly on the data rather than going through custom-built object methods
- Python modules are just as convenient for collecting routines to manipulate the data as they are for collecting related objects
- Working with the raw data structure tended to be a little confusing (some of this seems to be inherent in the relationships, so this may just be learning curve)
- Serialization is still a problem
On balance, even with decades writing object-oriented code, I'd be tempted to switch to a more direct approach except for the serialization issues. Serialization would seem to be orthogonal to working with the data, but the objects are a very convenient place to park the serialization and parsing code. More to the point when thinking about using this in production code, the XML serialization turned out to actually be faster and more compact than the standard Python ways of serializing data (repr and pickle). Plus, there are tools for looking at XML which are often very helpful.
Maybe my reactions would be different if I gave up on syntax and did it all in Clojure! Or maybe the trick would be to use Python objects that can act the way more fundamental data structures do and get the best of both worlds. One nice thing about Python is it's flexibility and over the last few years it has added features such as making sure the base classes can be extended that should make that possible. This may still be an approach worth trying.
--Th
Note:
Implementations of Clojure are also available on the CLR and JavaScript. My very limited trials in Clojure made it seem slower than I hoped. According to Clojure experts it is possible to approach the speed of Java if you are willing to put in hints and use the Java libraries more directly. Another language I looked at recently is Julia. Their approach is to make everything fast, even though it is a dynamic language. Supposedly, straight-forward Julia code approaches the speed of C.
JSON is another serialization possibility, and as of 2.6 is included in Python, but it doesn't support sets.
Update 2013.04.17: It turns out that it is fairly easy to extend the Python JSON encoder/decoder to handle sets. Unfortunately it is harder to handle Counters which look like dictionaries to the encoder and serialize fine, but with no indication that they are Counters.
Update 2013.04.19: Rich Hickey's problems with typical objects aren't just hiding the data, but also the mutability of the objects without any concept of 'time'. See his talk: http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey.
I have been using YAML for serialization, in Python, Perl, and Java, for several years now. The recent issues with Ruby's YAML implementation have not changed my mind about it, but they have made me use the "safe" load and dump routines, which greatly reduce the risks.
Posted by: Devon Smith | April 17, 2013 at 06:30