[Python-Dev] "groupby" iterator

Guido van Rossum guido at python.org
Fri Nov 28 18:41:58 EST 2003

> Yes.  I recommend taking showers on a regular basis ;-)

Jack Jansen wants me to add: especially right after riding your
bicycle to work.  And my boss will agree.  (Enough for in-jokes that
no-one will get. :-)

> I'll experiment with groupby() for a few more days and see how it
> feels.  The first impression is that it meets all the criteria for
> becoming an itertool (iters in, iters out; no unexpected memory use;
> works well with other tools; not readily constructed from existing
> tools).


> At first, the tool seems more special purpose than general purpose.
> OTOH, it is an excellent solution to a specific class of problems and it
> makes code much cleaner by avoiding the repeated code block in the
> non-iterator version.
> > I would make one change: after looking at another use case, I'd like
> > to change the outer iterator to produce (key, grouper) tuples.  This
> > way, you can write things like
> > 
> >   totals = {}
> >   for key, group in sequence:
> >       totals[key] = sum(group)

Oops, there's a mistake.  I meant to say:

    totals = {}
    for key, group in groupby(keyfunc, sequence):
        totals[key] = sum(group)

> This is a much stronger formulation than the original.  It is clear,
> succinct, expressive, and less error prone.

I'm not sure to what extent this praise was inspired by my mistake of
leaving out the groupby() call.

> The implementation would be more complex than the original.

To the contrary.  It was a microscopic change to either of the Python
versions I posted, because the key to be returned is always available
at exactly the right time.

> If the
> group is ignored, the outer iterator needs to be smart enough to read
> through the input iterator until the next group is encountered:
> >>> names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P']
> >>> firstname = lambda n: n.split()[0]
> >>> names.sort()
> >>> unique_first_names = [first for first, _ in groupby(firstname,
> names)]
> ['Barry' , 'Jack', 'Tim']

I don't think those semantics should be implemented.  You should be
required to iterate through each group.  I was just thinking that
returning the key might save the caller cumbersome logic if the key is
needed but the inner iterator is also needed.  The sum-by-group
example would become much uglier:

    totals = {}
    for group in groupby(keyfunc, sequence):
        first = group.next()
        key = keyfunc(first)
        totals[key] = first + sum(group, 0)

> In experimenting with groupby(), I am starting to see a need for a high
> speed data extractor function.  This need is common to several tools
> that take function arguments (like list.sort(key=)).

Exactly: it was definitely inspired by list.sort(key=).

> While extractor
> functions can be arbitrarily complex, many only fetch a specific
> attribute or element number.  Alex's high-speed curry suggests that it
> is possible to create a function maker for fast lookups:
> students.sort(key=extract('grade'))  # key=lambda r:r.grade
> students.sort(key=extract(2))        # key=lambda r:[2]

Perhaps we could do this by changing list.sort() and groupby() to take
a string or int as first argument to mean exactly this.  For the
string case I had thought of this already (in my second shower today
:-); the int case makes sense too.  (Though it may weaken my objection
against property('foo') in a different thread. :-)

But I recommend holding off on this -- the "pure" groupby() has enough
merit without speed hacks, and I find the clarity it provides more
important than possible speed gains.  I expect that the original, ugly
code is usually faster, but in the cases where I've needed this I
don't care: either the sequence isn't all that long, or the program
doesn't run all that frequently, or it does so much other stuff that
the speed gain would be drowned in the noise.

--Guido van Rossum (home page: http://www.python.org/~guido/)

More information about the Python-Dev mailing list