Re: [Python-Dev] "groupby" iterator

Nov. 28, 2003

      ...
Yes.  I recommend taking showers on a regular basis ;-)
Jack Jansen wants me to add: especially right after riding your
bicycle to work.  And my boss will agree.  (Enough for in-jokes that
no-one will get. :-)
...
I'll experiment with groupby() for a few more days and see how it
feels.  The first impression is that it meets all the criteria for
becoming an itertool (iters in, iters out; no unexpected memory use;
works well with other tools; not readily constructed from existing
tools).
Right.
...
At first, the tool seems more special purpose than general purpose.
OTOH, it is an excellent solution to a specific class of problems and it
makes code much cleaner by avoiding the repeated code block in the
non-iterator version.
...
I would make one change: after looking at another use case, I'd like
to change the outer iterator to produce (key, grouper) tuples.  This
way, you can write things like
totals = {}
  for key, group in sequence:
      totals[key] = sum(group)
Oops, there's a mistake.  I meant to say:

    totals = {}
    for key, group in groupby(keyfunc, sequence):
        totals[key] = sum(group)
...
This is a much stronger formulation than the original.  It is clear,
succinct, expressive, and less error prone.
I'm not sure to what extent this praise was inspired by my mistake of
leaving out the groupby() call.
...
The implementation would be more complex than the original.
To the contrary.  It was a microscopic change to either of the Python
versions I posted, because the key to be returned is always available
at exactly the right time.
...
If the
group is ignored, the outer iterator needs to be smart enough to read
through the input iterator until the next group is encountered:
...
...
...
names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P']
firstname = lambda n: n.split()[0]
names.sort()
unique_first_names = [first for first, _ in groupby(firstname,
names)]
['Barry' , 'Jack', 'Tim']
I don't think those semantics should be implemented.  You should be
required to iterate through each group.  I was just thinking that
returning the key might save the caller cumbersome logic if the key is
needed but the inner iterator is also needed.  The sum-by-group
example would become much uglier:

    totals = {}
    for group in groupby(keyfunc, sequence):
        first = group.next()
        key = keyfunc(first)
        totals[key] = first + sum(group, 0)
...
In experimenting with groupby(), I am starting to see a need for a high
speed data extractor function.  This need is common to several tools
that take function arguments (like list.sort(key=)).
Exactly: it was definitely inspired by list.sort(key=).
...
While extractor
functions can be arbitrarily complex, many only fetch a specific
attribute or element number.  Alex's high-speed curry suggests that it
is possible to create a function maker for fast lookups:
students.sort(key=extract('grade'))  # key=lambda r:r.grade
students.sort(key=extract(2))        # key=lambda r:[2]
Perhaps we could do this by changing list.sort() and groupby() to take
a string or int as first argument to mean exactly this.  For the
string case I had thought of this already (in my second shower today
:-); the int case makes sense too.  (Though it may weaken my objection
against property('foo') in a different thread. :-)

But I recommend holding off on this -- the "pure" groupby() has enough
merit without speed hacks, and I find the clarity it provides more
important than possible speed gains.  I expect that the original, ugly
code is usually faster, but in the cases where I've needed this I
don't care: either the sequence isn't all that long, or the program
doesn't run all that frequently, or it does so much other stuff that
the speed gain would be drowned in the noise.

--Guido van Rossum (home page: http://www.python.org/~guido/)