"groupby" is brilliant!

James Stroud jstroud at ucla.edu
Tue Jun 13 16:27:58 EDT 2006


James Stroud wrote:
> Frank Millman wrote:
> 
>> Hi all
>>
>> This is probably old hat to most of you, but for me it was a
>> revelation, so I thought I would share it in case someone has a similar
>> requirement.
>>
>> I had to convert an old program that does a traditional pass through a
>> sorted data file, breaking on a change of certain fields, processing
>> each row, accumulating various totals, and doing additional processing
>> at each break. I am not using a database for this one, as the file
>> sizes are not large - a few thousand rows at most. I am using csv
>> files, and using the csv module so that each row is nicely formatted
>> into a list.
>>
>> The traditional approach is quite fiddly, saving the values of the
>> various break fields, comparing the values on each row with the saved
>> values, and taking action if the values differ. The more break fields
>> there are, the fiddlier it gets.
>>
>> I was going to do the same in python, but then I vaguely remembered
>> reading about 'groupby'. It took a little while to figure it out, but
>> once I had cracked it, it transformed the task into one of utter
>> simplicity.
>>
>> Here is an example. Imagine a transaction file sorted by branch,
>> account number, and date, and you want to break on all three.
>>
>> -----------------------------
>> import csv
>> from itertools import groupby
>> from operator import itemgetter
>>
>> BRN = 0
>> ACC = 1
>> DATE = 2
>>
>> reader = csv.reader(open('trans.csv', 'rb'))
>> rows = []
>> for row in reader:
>>     rows.append(row)
>>
>> for brn,brnList in groupby(rows,itemgetter(BRN)):
>>     for acc,accList in groupby(brnList,itemgetter(ACC)):
>>         for date,dateList in groupby(accList,itemgetter(DATE)):
>>             for row in dateList:
>>                 [do something with row]
>>             [do something on change of date]
>>         [do something on change of acc]
>>     [do something on change of brn]
>> -----------------------------
>>
>> Hope someone finds this of interest.
>>
>> Frank Millman
>>
> 
> I'm sure I'm going to get a lot of flac on this list for proposing to 
> turn nested for-loops into a recursive function, but I couldn't help 
> myself. This seems more simple to me, but for others it may be difficult 
> to look at, and these people will undoubtedly complain.
> 
> 
> import csv
> from itertools import groupby
> from operator import itemgetter
> 
> reader = csv.reader(open('trans.csv', 'rb'))
> rows = []
> for row in reader:
>     rows.append(row)
> 
> def brn_doer(row):
>   [doing something with brn here]
> 
> def acc_doer(date):
>   [you get the idea]
> 
> [etc.]
> 
> doers = [brn_doer, acc_doer, date_doer, row_doer]
> 
> def doit(rows, doers, i=0):
>   for r, alist in groupby(rows, itemgetter(i)):
>     doit(alist, doers[1:], i+1)
>     doers[0](r)
> 
> doit(rows, doers, 0)
> 
> Now all of those ugly for loops become one recursive function. Bear in 
> mind, its not all that 'elegant', but it looks nicer, is more succinct, 
> abstracts the process, and scales to arbitrary depth. Tragically, 
> however, it has been generalized, which is likely to raise some hackles 
> here. And, oh yes, it didn't answer exactly your question (which you 
> didn't really have). I'm sure I will regret this becuase, as you will 
> find, suggesting code on this list with additional utility is somewhat 
> discouraged by the vociferous few who make a religion out of 'import this'.
> 
> Also, I still have no idea what 'groupby' does. It looks interesting 
> thgough, thanks for pointing it out.
> 
> James
> 

Forgot to test for stopping condition:


def doit(rows, doers, i=0):
   for r, alist in groupby(rows, itemgetter(i)):
     if len(doers) > 1:
       doit(alist, doers[1:], i+1)
     doers[0](r)

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/



More information about the Python-list mailing list