Jeffrey Dean and Sanjay Ghemawat of Google have written a paper about a method of processing large data sets they call MapReduce.
Many will be familiar with the functional programming constructs of map and reduce. Map applies a function against each element of a list to get a transformed version of the list. For example, in Python, map(chr, [97,98,99]) transforms a list of three numbers into a list containing the equivalent characters:
>>> map(chr, [97,98,99])
['a', 'b', 'c']
It's as if you executed [chr(97),chr(98),chr(99)].
Reduce takes a function and runs it against items in the list, resulting in a single value:
>>> reduce(operator.add, ['a','b','c'])
This is the string formed by the operations ('a'+'b')+'c'. This programming style lends itself naturally to nesting:
>>> reduce(operator.add, map(chr, [97,98,99]))
The functional aspects of these operations are similar to Unix filters where files get piped from one filter to another. Here's a filter that will take a file of MARC21 records, transform the end-of-record markers to line-feeds, select records with the word 'smollet' in them and then count them:
cat clinker.marcu | tr '\035' '\n' | grep -iw 'smollet' | wc -l
Comparing this to map/reduce the cat, tr, and grep commands are similar to map and the wc command to reduce.
The Google model is that given a set of key/value pairs the map function produces a new set of key/value pairs based on a function supplied by the programmer. The reduce function collapses all the values for a given key to a single value. Google has found that offering a robust implementation of this that can run in a massively parallel environment (thousands of nodes) has made it possible to routinely process huge files in many different ways. The slides offer a good overview of their work.
Here's a more involved example written in Python that closely follows the Google approach:
First we need a list to process as input:
This is a list of 5 key-value pairs. You might think of the key as record number and the string as the record.
Here's our map function. It takes in a list of key-value pairs, such as our input, and returns a new list with the string as the key with the record number as the value if it finds an 'a' in the string:
def myMap(gen): return ( (v,k) for k,v in gen if v.find('a')!=-1)
For our input list, this returns:
(('cat', 3), ('aardvark', 4), ('cat', 5))
Next this list gets grouped so that all the record numbers for each word are collected together. You can find the code to do this at the end of the post. Here's the grouped list it outputs:
(('aardvark', ), ('cat', [3, 5]))
This shows that 'aardvark' occurred in record 4, 'cat' in records 3 and 5.
A reduce function that outputs each word with a count:
def myReduce(gen): return ((k, len(v)) for k,v in gen)
From the grouped results this will generate:
(('aardvark', 1), ('cat', 2))
What Google has done is take the map/reduce paradigm and make it work in parallel in their environment of thousands of millions of records. Our work with our own (somewhat smaller scale) Beowulf cluster made us think we could usefully use many of their concepts in our own processing of tens of millions of bibliographic records. (Actually OCLC has more than a thousand million records, but we don't maintain those online yet).
More on our own work with MapReduce (which we are doing in Python) in a subsequent post.
Here is all the code in one spot, including the group function that is run between map and reduce:
input = ((1,'boy'),(2,'dog'),(3,'cat'),(4,'aardvark'), (5,'cat'))
def doMap(gen): return ( (v,k) for k,v in gen if v.find('a')!=-1)
def doReduce(gen): return ((k, len(v)) for k,v in gen)
def group(gen): # accept a list of key,value pairs
sl = sorted(list(gen)) # sort
if not sl: return # might be empty
rkey, rlist = sl, [sl] # a key and list to return
for k,v in sl[1:]: # process rest of sorted list
rlist.append(v) # extend the list for this key
yield (rkey, rlist) # output key & list
rkey, rlist = k, [v]# start next key & list
yield(rkey, rlist) # output last key & list
--Th & Jenny Toves