"groupby" is brilliant!

James Stroud jstroud at ucla.edu
Tue Jun 13 16:16:04 EDT 2006

Frank Millman wrote:
> Hi all
> This is probably old hat to most of you, but for me it was a
> revelation, so I thought I would share it in case someone has a similar
> requirement.
> I had to convert an old program that does a traditional pass through a
> sorted data file, breaking on a change of certain fields, processing
> each row, accumulating various totals, and doing additional processing
> at each break. I am not using a database for this one, as the file
> sizes are not large - a few thousand rows at most. I am using csv
> files, and using the csv module so that each row is nicely formatted
> into a list.
> The traditional approach is quite fiddly, saving the values of the
> various break fields, comparing the values on each row with the saved
> values, and taking action if the values differ. The more break fields
> there are, the fiddlier it gets.
> I was going to do the same in python, but then I vaguely remembered
> reading about 'groupby'. It took a little while to figure it out, but
> once I had cracked it, it transformed the task into one of utter
> simplicity.
> Here is an example. Imagine a transaction file sorted by branch,
> account number, and date, and you want to break on all three.
> -----------------------------
> import csv
> from itertools import groupby
> from operator import itemgetter
> BRN = 0
> ACC = 1
> DATE = 2
> reader = csv.reader(open('trans.csv', 'rb'))
> rows = []
> for row in reader:
>     rows.append(row)
> for brn,brnList in groupby(rows,itemgetter(BRN)):
>     for acc,accList in groupby(brnList,itemgetter(ACC)):
>         for date,dateList in groupby(accList,itemgetter(DATE)):
>             for row in dateList:
>                 [do something with row]
>             [do something on change of date]
>         [do something on change of acc]
>     [do something on change of brn]
> -----------------------------
> Hope someone finds this of interest.
> Frank Millman

I'm sure I'm going to get a lot of flac on this list for proposing to 
turn nested for-loops into a recursive function, but I couldn't help 
myself. This seems more simple to me, but for others it may be difficult 
to look at, and these people will undoubtedly complain.

import csv
from itertools import groupby
from operator import itemgetter

reader = csv.reader(open('trans.csv', 'rb'))
rows = []
for row in reader:

def brn_doer(row):
   [doing something with brn here]

def acc_doer(date):
   [you get the idea]


doers = [brn_doer, acc_doer, date_doer, row_doer]

def doit(rows, doers, i=0):
   for r, alist in groupby(rows, itemgetter(i)):
     doit(alist, doers[1:], i+1)

doit(rows, doers, 0)

Now all of those ugly for loops become one recursive function. Bear in 
mind, its not all that 'elegant', but it looks nicer, is more succinct, 
abstracts the process, and scales to arbitrary depth. Tragically, 
however, it has been generalized, which is likely to raise some hackles 
here. And, oh yes, it didn't answer exactly your question (which you 
didn't really have). I'm sure I will regret this becuase, as you will 
find, suggesting code on this list with additional utility is somewhat 
discouraged by the vociferous few who make a religion out of 'import this'.

Also, I still have no idea what 'groupby' does. It looks interesting 
thgough, thanks for pointing it out.


James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095


More information about the Python-list mailing list