RE: [Python-Dev] "groupby" iterator

Nov. 29, 2003


      [Alex]
...
However, one cosmetic suggestion: for analogy with list.sorted, why
not let the call be spelled as
    groupby(sequence, key=keyfunc)
?
I realize most itertools take a callable _first_, while, to be able to
name the key-extractor this way, it would have to go second.  I still
think it would be nicer, partly because while sequence could not
possibly default, key _could_ -- and its one obvious default is to an
identity (lambda x: x).  This would let elimination and/or counting of
adjacent duplicates be expressed smoothly (for counting, it would
help to have an ilen that gives the length of a finite iterable
argument,
but worst case one can substitute
    def ilen(it):
        for i, _ in enumerate(it): pass
        return i+1
or its inline equivalent).
Though the argument order makes my stomach churn, the identity function
default is quite nice:
...
...
...
s = 'abracadabra;
...
...
...
# sort s | uniq
[k for k, g in groupby(list.sorted(s))]
['a', 'b', 'c', 'd', 'r']
...
...
...
# sort s | uniq -d
[k for k, g in groupby(list.sorted('abracadabra')) if ilen(g)>1]
['a', 'b', 'r']
...
...
...
# sort s | uniq -c
[(ilen(g), k) for k, g in groupby(list.sorted(s))]
[(5, 'a'), (2, 'b'), (1, 'c'), (1, 'd'), (2, 'r')]
...
...
...
sort s | uniq -c | sort -rn | head -3
list.sorted([(ilen(g), k) for k, g in groupby(list.sorted(s))],
reverse=True)[:3]
[(5, 'a'), (2, 'r'), (2, 'b')]
...
...
...
While extractor
functions can be arbitrarily complex, many only fetch a specific
attribute or element number.  Alex's high-speed curry suggests
that it
is possible to create a function maker for fast lookups:
students.sort(key=extract('grade'))  # key=lambda r:r.grade
students.sort(key=extract(2))        # key=lambda r:[2]
Perhaps we could do this by changing list.sort() and groupby() to
take
a string or int as first argument to mean exactly this.  For the
It seems to be that this would be specialcasing things while an
extract
function might help in other contexts as well.  E.g., itertools has
several
other iterators that take a callable and might use this.
...
But I recommend holding off on this -- the "pure" groupby() has
enough
merit without speed hacks, and I find the clarity it provides more
important than possible speed gains.  I expect that the original,
ugly
I agree that the case for extract is separate from that for groupby
(although
the latter does increase the attractiveness of the former).
Yes, it's clearly a separate issue (and icing on the cake).  I was
thinking extract() would be a nice addition to the operator module where
everything is basically a lambda evading speed hack for accessing
intrinsic operations:  operator.add = lambda x,y: x+y


Raymond

RE: [Python-Dev] "groupby" iterator

Raymond Hettinger