sanelson at gmail.com
Mon Feb 5 21:54:27 CET 2007
I have to give a presentation this week on how the MapReduce (of
Google and Hadoop fame) algorithm works.
I understand how map() works, and how reduce() works, and having read
the google papers, I have an idea of their implementation (which I
must say takes certain liberties with FP-derived terminology).
As I understand it, we have a "map" function which is passed a set of
key/value pairs, which, by applying a user function, produces an
intermediate set of key-value pairs.
I've come up with an example... return the url and how frequently
occurs if the url contains a given word. I'm planning to pass a "map"
function an indexed list of urls:
[(1, 'http://www.beer.com'), (2, 'http://www.ban-beer.com'), (3,
'http://www.bbc.co.uk'), (4, 'http://www.beer.com'), (5,
And I have a map function with reverses the keys and values if the
word is present:
return [(url, key) for key, url in stuff if "beer" in url]
This does as expected:
[('http://www.beer.com', 1), ('http://www.ban-beer.com', 2),
What I want to do is now "group" these urls so that repeated urls have
as their "partner" a lsit of indexes. To take a test example of the
method I have in mind:
"""Group occurences of a record together"""
test_list = [('fred', 1), ('jim', 2), ('bill', 3), ('jim', 4)]
grouped_list = [('fred', 1), ('jim', [2, 4]), ('bill' ,3)]
I have seen some code which purports to do this, but I thought it was
ugly, and it seemed to return a generator object rather than a list.
I refactored it for clarity thus:
sorted_list = sorted(stuff)
if not sorted_list:
reduce_key, reduce_list = sorted_list, sorted_list
for key, url in sorted_list[1:]:
if key == reduce_key:
yield (remote_key, remote_list)
remote_key, remote_list = k, [v]
Does this make sense? I would like a clearer, more attractive way of
making the test pass. If this can be done in functional style, even
Your help is always appreciated!
More information about the Tutor