In a previous post on FlatFiles, I mentioned that we have developed a distributed version using MPI to work with mapreduce. In this model an application first checks to see of the key of the record it wants is in its own address space. If not, it broadcasts a request for the record. Listeners on all the nodes check to see if they have it, and the one that does returns it.
I wondered if HTTP could be used instead of MPI. We always like to use the standard Python libraries when possible for portability and ease of maintenance. In the application I modeled for this, the key is the OCLC number and the data is the associated bibliographic record (currently we have [about] 60 million of these).
Although we started thinking of a FlatFile implementation, it became fairly obvious that some optimizations were possible. Since the OCLC numbers are 'dense' a simple Python array with an offset to the first OCLC number in a group gives a fast and easy lookup. The records are long enough (~1,000 bytes) so that the overhead in storing them in an array is low. Using a Python dictionary probably would have worked as well, since dictionary lookups seem nearly as fast as array accesses in Python.
A straight-forward implementation using Python's built in HTTP server resulted in a server capable of responding to about 300 requests for records per second on my 3 gigahertz Windows machine. That's pretty fast, but about half the speed of a remote access MPI request. Urlgrabber offers a pure-Python extension that enables keep-alive for HTTP connections. That doubles the speed on my Windows machine to around 600 requests/second.
The limiting factor seems to be more the turn-around time, rather than the data transfer. If an application could ask for 10 records at a time, the throughput increases substantially. Here are some rough figures:
|Request Size in Records||Requests/Second||Records/Second|
Urlgrabber's keep-alive didn't seem to work on our Linux cluster, although the response time is still about 600 requests/second. It doesn't seem to matter much whether the client-server pair resides on one machine, or two.
If we can get something like 500 lookups/second on 24 nodes, that's more than 10,000/second across our cluster, or about 2 minutes/million records. 60 million records would take around 2 hours. Not super impressive compared to a 3 second grep of the file, but this is random access, and it is competitive with the MPI implementation. If the applications are doing sequential reads, they might be able to ask for many records at a time, reducing the time to just a couple of minutes or less to read the records (my best time to read a million records on my workstation is 12 seconds).