[Tutor] MapReduce

Tue Feb 6 00:58:44 CET 2007

Steve Nelson wrote:
> On 2/5/07, Steve Nelson <sanelson at gmail.com> wrote:
>> What I want to do is now "group" these urls so that repeated urls have
>> as their "partner" a lsit of indexes.  To take a test example of the
>> method I have in mind:
>>
>> def testGrouper(self):
>>     """Group occurences of a record together"""
>>     test_list = [('fred', 1), ('jim', 2), ('bill', 3), ('jim', 4)]
>>     grouped_list = [('fred', 1), ('jim', [2, 4]), ('bill' ,3)]
>>     self.assertEqual(myGroup(test_list), grouped_list)
> 
> <snip>
> 
>> I would like a clearer, more attractive way of
>> making the test pass.  If this can be done in functional style, even
>> better.
> 
> I now have:
> 
> def myGroup(stuff):
>   return [(key, map(lambda item: item[1], list(group))) for key, group
> in groupby(sorted(stuff), lambda item: item[0] )]
> 
> Not sure I fully understand how groupby objects work, nor what a
> sub-iterator is, though.  But I more or less understand it.

Sub-iterator is just a way to refer to a nested iterator - groupby() 
yields tuples one of whose members is an iterator. Since groupby() is 
also an iterator (well, a generator actually), they call the nested 
iterator a sub-iterator.
> 
> I understand I could use itemgetter() instead of the lambda...
> 
> Can anyone clarify?

I have written an explanation of itemgetter and groupby here:
http://personalpages.tds.net/~kent37/blog/arch_m1_2005_12.html#e69

You can also do this operation easily with dicts (not tested!):

def myGroup(stuff):
   groups = {}
   for url, index in stuff:
     groups.setdefault(url, []).append(index)
   return sorted(groups.items())

Or a bit less opaque in Python 2.5, avoiding setdefault():
from collections import defaultdict
def myGroup(stuff):
   groups = defaultdict(list)
   for url, index in stuff:
     groups[url].append(index)
   return sorted(groups.items())

Kent