For our processing of bibliographic records with Python, one of the more useful classes we've developed is something we call flat files. A flat file has a very simple structure. It is composed of lines separated by line feeds, and each line is a key-value pair, separated by a tab. The keys are required to be unique and sorted. A very simple format, but like Python dictionaries, very useful.
We use this structure for loading our authority files into memory for processing. The whole file is read in as a single string, and then a binary-chop is used to find the data associated with a particular key (well, actually we do a little more than that now).
The main advantage these files have is that they are very compact. In Python it would be easy to split the file into a list of lines, but for our typically short lines, this would probably double the memory needed. Since our Python processes are limited to 2 gigabytes of memory (and seem to get sluggish when they get much beyond a single gigabyte) this is a consideration.
I coded up the first version of this, but Jenny Toves has done a lot of work on it since. We doubled the speed (from 12,000/sec to 25,000/second) by processing the file on input and creating an index over the first few levels of the binary search. More interestingly, for our FRBR work, Jenny has a distributed version of FlatFile which runs on our Beowulf cluster. This is coded with MPI and supports about 600 retrievals/second across the cluster. We haven't released this code as open source yet, but if anyone's interested I expect we could. We've also done some experiments in doing this using HTTP which I'll report on later.
If you are interested in the 'Birch flat file' pictured above, contact Wilson Woodworking.
--Th
Comments