[Python-Dev] "groupby" iterator

Sat Nov 29 02:03:22 EST 2003

On Saturday 29 November 2003 12:41 am, Guido van Rossum wrote:
   ...
> > >   totals = {}
> > >   for key, group in sequence:
> > >       totals[key] = sum(group)
>
> Oops, there's a mistake.  I meant to say:
>
>     totals = {}
>     for key, group in groupby(keyfunc, sequence):
>         totals[key] = sum(group)
>
> > This is a much stronger formulation than the original.  It is clear,
> > succinct, expressive, and less error prone.
>
> I'm not sure to what extent this praise was inspired by my mistake of
> leaving out the groupby() call.

Can't answer for RH, but, to me, the groupby call looks just fine.

However, one cosmetic suggestion: for analogy with list.sorted, why
not let the call be spelled as
    groupby(sequence, key=keyfunc)
?

I realize most itertools take a callable _first_, while, to be able to
name the key-extractor this way, it would have to go second.  I still
think it would be nicer, partly because while sequence could not
possibly default, key _could_ -- and its one obvious default is to an
identity (lambda x: x).  This would let elimination and/or counting of
adjacent duplicates be expressed smoothly (for counting, it would
help to have an ilen that gives the length of a finite iterable argument,
but worst case one can substitute
    def ilen(it):
        for i, _ in enumerate(it): pass
        return i+1
or its inline equivalent).

Naming the function 'grouped' rather than 'groupby' would probably
be better if the callable was the second arg rather than the first.

> > >>> names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P']
> > >>> firstname = lambda n: n.split()[0]
> > >>> names.sort()
> > >>> unique_first_names = [first for first, _ in groupby(firstname,
> > names)]
> > ['Barry' , 'Jack', 'Tim']
>
> I don't think those semantics should be implemented.  You should be
> required to iterate through each group.  I was just thinking that

Right, so basically it would have to be nested like:

ufn = [ f for g in groupby(firstname, names) for f, _ in g ]

> > In experimenting with groupby(), I am starting to see a need for a high
> > speed data extractor function.  This need is common to several tools
> > that take function arguments (like list.sort(key=)).
>
> Exactly: it was definitely inspired by list.sort(key=).

That's part of why I'd love to be able to spell key= for this iterator too.

> > While extractor
> > functions can be arbitrarily complex, many only fetch a specific
> > attribute or element number.  Alex's high-speed curry suggests that it
> > is possible to create a function maker for fast lookups:
> >
> > students.sort(key=extract('grade'))  # key=lambda r:r.grade
> > students.sort(key=extract(2))        # key=lambda r:[2]
>
> Perhaps we could do this by changing list.sort() and groupby() to take
> a string or int as first argument to mean exactly this.  For the

It seems to be that this would be specialcasing things while an extract
function might help in other contexts as well.  E.g., itertools has several
other iterators that take a callable and might use this.

> But I recommend holding off on this -- the "pure" groupby() has enough
> merit without speed hacks, and I find the clarity it provides more
> important than possible speed gains.  I expect that the original, ugly

I agree that the case for extract is separate from that for groupby (although
the latter does increase the attractiveness of the former).

Alex