« XML and authority control | Main | Entire Library of Congress »


Jacques Mattheij

Hi there, I came across your site while looking for information about map-reduce using MPI, and I was wondering what your take is on the 'hadoop' way of doing things, moving the code to the data. Apparently they migrate the programs to be as close to the data as possible during the 'map' stage, which to me seems a good way of doing things. Unfortunately they decided to write their stuff in Java so now you need all kinds of silly trickery to access the data from non-java programs. 10 for the idea, but -3 for implementation...

best regards,

Jacques Mattheij


We typically spread our data out across the cluster so that the input files for the mappers are on the node the mapper runs on.

We've looked at Hadoop some. Our implementation is a lot simpler, but we think we will have scaling issues if we move into thousands of nodes (currently we have 132 cpu on 33 nodes). One of things that Hadoop does is implement its own file system. This has some advantages, but adds overhead and may be part of what Jacques is talking about.


The comments to this entry are closed.

My Photo

April 2018

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30