[Python-Dev] "groupby" iterator
Guido van Rossum
guido at python.org
Sat Nov 29 00:50:56 EST 2003
> I've always done it like:
>
> d = {}
> for x in sequence:
> d.setdefault(key(x), []).append(x)
> # Now d has partitioned sequence by key. The keys are
> # available as d.keys(), the associated groups as d.values().
> # So, e.g.,
> for key, group in d.iteritems():
> d[key] = sum(group)
>
> There's no code duplication, or warts for an empty sequence, which are the
> ugly parts of the non-dict approach. It doesn't matter here whether the
> elements orginally appear with equal keys all adjacent, and input often
> isn't sorted that way. When it isn't, not needing to sort first can be a
> major time savings if the sequence is big. Against it, a dict is a large
> data structure. I don't think it's ever been a real problem that it
> requires keys to be hashable.
The major downside of this is that this keeps everything in memory.
When that's acceptable, it's a great approach (especially because it
doesn't require sorting). But often you really want to be able to
handle input of arbitrary size. For example, suppose you are given a
file with some kind of records, timestamped and maintained in
chronological order (e.g. a log file -- perfect example of data that
won't fit in memory and is already sorted). You're supposed to output
this for printing, while inserting a header at the start of each day
and a footer at the end of each day with various counts or totals per
day.
> groupby() looks very nice when it applies.
Right. :-)
> > ...
> > totals = {}
> > for key, group in groupby(keyfunc, sequence):
> > totals[key] = sum(group)
>
> Or
>
> totals = dict((key, sum(group))
> for key, group in groupby(keyfunc, sequence))
>
> exploiting generator expressions too.
Nice. When can we get these? :-)
> [after Raymond wonders about cases where the consumer doesn't
> iterate over the group generators
> ]
>
> > I don't think those semantics should be implemented. You should be
> > required to iterate through each group.
>
> Brrrr. Sounds error-prone (hard to explain, and impossible to enforce
> unless the implementation does almost all the work it would need to allow
> groups to get skipped -- if the implementation can detect that a group
> hasn't been fully iterated, then it could almost as easily go on to skip
> over remaining equal keys itself instead of whining about it; but if the
> implementation can't detect it, accidental violations of the requirement
> will be hard to track down).
I take it back after seeing Raymond's implementation -- it's simple
enough to make sure that each group is exhausted before starting the
next group, and this is clearly the "natural" semantics.
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-Dev
mailing list