[Python-Dev] "groupby" iterator

Sat Nov 29 00:50:56 EST 2003

> I've always done it like:
> 
>     d = {}
>     for x in sequence:
>         d.setdefault(key(x), []).append(x)
>     # Now d has partitioned sequence by key.  The keys are
>     # available as d.keys(), the associated groups as d.values().
>     # So, e.g.,
>     for key, group in d.iteritems():
>         d[key] = sum(group)
> 
> There's no code duplication, or warts for an empty sequence, which are the
> ugly parts of the non-dict approach.  It doesn't matter here whether the
> elements orginally appear with equal keys all adjacent, and input often
> isn't sorted that way.  When it isn't, not needing to sort first can be a
> major time savings if the sequence is big.  Against it, a dict is a large
> data structure.  I don't think it's ever been a real problem that it
> requires keys to be hashable.

The major downside of this is that this keeps everything in memory.
When that's acceptable, it's a great approach (especially because it
doesn't require sorting).  But often you really want to be able to
handle input of arbitrary size.  For example, suppose you are given a
file with some kind of records, timestamped and maintained in
chronological order (e.g. a log file -- perfect example of data that
won't fit in memory and is already sorted).  You're supposed to output
this for printing, while inserting a header at the start of each day
and a footer at the end of each day with various counts or totals per
day.

> groupby() looks very nice when it applies.

Right. :-)

> > ...
> >   totals = {}
> >   for key, group in groupby(keyfunc, sequence):
> >       totals[key] = sum(group)
> 
> Or
> 
>     totals = dict((key, sum(group))
>                   for key, group in groupby(keyfunc, sequence))
> 
> exploiting generator expressions too.

Nice.  When can we get these? :-)

> [after Raymond wonders about cases where the consumer doesn't
>  iterate over the group generators
> ]
> 
> > I don't think those semantics should be implemented.  You should be
> > required to iterate through each group.
> 
> Brrrr.  Sounds error-prone (hard to explain, and impossible to enforce
> unless the implementation does almost all the work it would need to allow
> groups to get skipped -- if the implementation can detect that a group
> hasn't been fully iterated, then it could almost as easily go on to skip
> over remaining equal keys itself instead of whining about it; but if the
> implementation can't detect it, accidental violations of the requirement
> will be hard to track down).

I take it back after seeing Raymond's implementation -- it's simple
enough to make sure that each group is exhausted before starting the
next group, and this is clearly the "natural" semantics.

--Guido van Rossum (home page: http://www.python.org/~guido/)