[Tutor] MapReduce
Steve Nelson
sanelson at gmail.com
Mon Feb 5 21:54:27 CET 2007
Hello,
I have to give a presentation this week on how the MapReduce (of
Google and Hadoop fame) algorithm works.
I understand how map() works, and how reduce() works, and having read
the google papers, I have an idea of their implementation (which I
must say takes certain liberties with FP-derived terminology).
As I understand it, we have a "map" function which is passed a set of
key/value pairs, which, by applying a user function, produces an
intermediate set of key-value pairs.
I've come up with an example... return the url and how frequently
occurs if the url contains a given word. I'm planning to pass a "map"
function an indexed list of urls:
>>> stuff
[(1, 'http://www.beer.com'), (2, 'http://www.ban-beer.com'), (3,
'http://www.bbc.co.uk'), (4, 'http://www.beer.com'), (5,
'http://wwww.kernel.org')]
And I have a map function with reverses the keys and values if the
word is present:
def myMap(stuff):
return [(url, key) for key, url in stuff if "beer" in url]
This does as expected:
>>> myMap(stuff)
[('http://www.beer.com', 1), ('http://www.ban-beer.com', 2),
('http://www.beer.com', 4)]
What I want to do is now "group" these urls so that repeated urls have
as their "partner" a lsit of indexes. To take a test example of the
method I have in mind:
def testGrouper(self):
"""Group occurences of a record together"""
test_list = [('fred', 1), ('jim', 2), ('bill', 3), ('jim', 4)]
grouped_list = [('fred', 1), ('jim', [2, 4]), ('bill' ,3)]
self.assertEqual(myGroup(test_list), grouped_list)
I have seen some code which purports to do this, but I thought it was
ugly, and it seemed to return a generator object rather than a list.
I refactored it for clarity thus:
def myGroup(stuff):
sorted_list = sorted(stuff)
if not sorted_list:
return
reduce_key, reduce_list = sorted_list[0][0], sorted_list[0][1]
for key, url in sorted_list[1:]:
if key == reduce_key:
reduce_list.append(value)
else:
yield (remote_key, remote_list)
remote_key, remote_list = k, [v]
yield(remote_key, remote_list)
Does this make sense? I would like a clearer, more attractive way of
making the test pass. If this can be done in functional style, even
better.
Your help is always appreciated!
S.
More information about the Tutor
mailing list