[Tutor] MapReduce

Mon Feb 5 21:54:27 CET 2007

Hello,

I have to give a presentation this week on how the MapReduce (of
Google and Hadoop fame) algorithm works.

I understand how map() works, and how reduce() works, and having read
the google papers, I have an idea of their implementation (which I
must say takes certain liberties with FP-derived terminology).

As I understand it, we have a "map" function which is passed a set of
key/value pairs, which, by applying a user function, produces an
intermediate set of key-value pairs.

I've come up with an example... return the url and how frequently
occurs if the url contains a given word.  I'm planning to pass a "map"
function an indexed list of urls:
>>> stuff
[(1, 'http://www.beer.com'), (2, 'http://www.ban-beer.com'), (3,
'http://www.bbc.co.uk'), (4, 'http://www.beer.com'), (5,
'http://wwww.kernel.org')]

And I have a map function with reverses the keys and values if the
word is present:

def myMap(stuff):
  return [(url, key) for key, url in stuff if "beer" in url]

This does as expected:

>>> myMap(stuff)
[('http://www.beer.com', 1), ('http://www.ban-beer.com', 2),
('http://www.beer.com', 4)]

What I want to do is now "group" these urls so that repeated urls have
as their "partner" a lsit of indexes.  To take a test example of the
method I have in mind:

def testGrouper(self):
    """Group occurences of a record together"""
    test_list = [('fred', 1), ('jim', 2), ('bill', 3), ('jim', 4)]
    grouped_list = [('fred', 1), ('jim', [2, 4]), ('bill' ,3)]
    self.assertEqual(myGroup(test_list), grouped_list)

I have seen some code which purports to do this, but I thought it was
ugly, and it seemed to return a generator object rather than a list.

I refactored it for clarity thus:

def myGroup(stuff):
  sorted_list = sorted(stuff)
  if not sorted_list:
    return
  reduce_key, reduce_list = sorted_list[0][0], sorted_list[0][1]
  for key, url in sorted_list[1:]:
    if key == reduce_key:
      reduce_list.append(value)
    else:
      yield (remote_key, remote_list)
      remote_key, remote_list = k, [v]
  yield(remote_key, remote_list)

Does this make sense?  I would like a clearer, more attractive way of
making the test pass.  If this can be done in functional style, even
better.

Your help is always appreciated!

S.