In a previous post on FlatFiles, I mentioned that we have developed a distributed version using MPI to work with mapreduce. In this model an application first checks to see of the key of the record it wants is in its own address space. If not, it broadcasts a request for the record. Listeners on all the nodes check to see if they have it, and the one that does returns it.
I wondered if HTTP could be used instead of MPI. We always like to use the standard Python libraries when possible for portability and ease of maintenance. In the application I modeled for this, the key is the OCLC number and the data is the associated bibliographic record (currently we have [about] 60 million of these).
Although we started thinking of a FlatFile implementation, it became fairly obvious that some optimizations were possible. Since the OCLC numbers are 'dense' a simple Python array with an offset to the first OCLC number in a group gives a fast and easy lookup. The records are long enough (~1,000 bytes) so that the overhead in storing them in an array is low. Using a Python dictionary probably would have worked as well, since dictionary lookups seem nearly as fast as array accesses in Python.
A straight-forward implementation using Python's built in HTTP server resulted in a server capable of responding to about 300 requests for records per second on my 3 gigahertz Windows machine. That's pretty fast, but about half the speed of a remote access MPI request. Urlgrabber offers a pure-Python extension that enables keep-alive for HTTP connections. That doubles the speed on my Windows machine to around 600 requests/second.
The limiting factor seems to be more the turn-around time, rather than the data transfer. If an application could ask for 10 records at a time, the throughput increases substantially. Here are some rough figures:
Request Size in Records | Requests/Second | Records/Second |
---|---|---|
1 | 600 | 600 |
10 | 560 | 5,600 |
100 | 300 | 30,000 |
1,000 | 60 | 60,000 |
10,000 | 8 | 80,000 |
Urlgrabber's keep-alive didn't seem to work on our Linux cluster, although the response time is still about 600 requests/second. It doesn't seem to matter much whether the client-server pair resides on one machine, or two.
If we can get something like 500 lookups/second on 24 nodes, that's more than 10,000/second across our cluster, or about 2 minutes/million records. 60 million records would take around 2 hours. Not super impressive compared to a 3 second grep of the file, but this is random access, and it is competitive with the MPI implementation. If the applications are doing sequential reads, they might be able to ask for many records at a time, reducing the time to just a couple of minutes or less to read the records (my best time to read a million records on my workstation is 12 seconds).
--Th
Hi there, I came across your site while looking for information about map-reduce using MPI, and I was wondering what your take is on the 'hadoop' way of doing things, moving the code to the data. Apparently they migrate the programs to be as close to the data as possible during the 'map' stage, which to me seems a good way of doing things. Unfortunately they decided to write their stuff in Java so now you need all kinds of silly trickery to access the data from non-java programs. 10 for the idea, but -3 for implementation...
best regards,
Jacques Mattheij
ww.com
Posted by: Jacques Mattheij | August 31, 2007 at 19:20
We typically spread our data out across the cluster so that the input files for the mappers are on the node the mapper runs on.
We've looked at Hadoop some. Our implementation is a lot simpler, but we think we will have scaling issues if we move into thousands of nodes (currently we have 132 cpu on 33 nodes). One of things that Hadoop does is implement its own file system. This has some advantages, but adds overhead and may be part of what Jacques is talking about.
--Th
Posted by: ThomasBHickey | September 04, 2007 at 10:09