[Python-Dev] "groupby" iterator

Raymond Hettinger python at rcn.com
Sat Nov 29 03:26:38 EST 2003


[Alex]
> However, one cosmetic suggestion: for analogy with list.sorted, why
> not let the call be spelled as
>     groupby(sequence, key=keyfunc)
> ?
> 
> I realize most itertools take a callable _first_, while, to be able to
> name the key-extractor this way, it would have to go second.  I still
> think it would be nicer, partly because while sequence could not
> possibly default, key _could_ -- and its one obvious default is to an
> identity (lambda x: x).  This would let elimination and/or counting of
> adjacent duplicates be expressed smoothly (for counting, it would
> help to have an ilen that gives the length of a finite iterable
argument,
> but worst case one can substitute
>     def ilen(it):
>         for i, _ in enumerate(it): pass
>         return i+1
> or its inline equivalent).


Though the argument order makes my stomach churn, the identity function
default is quite nice:


>>> s = 'abracadabra;

>>> # sort s | uniq
>>> [k for k, g in groupby(list.sorted(s))]
['a', 'b', 'c', 'd', 'r']

>>> # sort s | uniq -d
>>> [k for k, g in groupby(list.sorted('abracadabra')) if ilen(g)>1]
['a', 'b', 'r']

>>> # sort s | uniq -c
>>> [(ilen(g), k) for k, g in groupby(list.sorted(s))]
[(5, 'a'), (2, 'b'), (1, 'c'), (1, 'd'), (2, 'r')]
	
>>> sort s | uniq -c | sort -rn | head -3
>>> list.sorted([(ilen(g), k) for k, g in groupby(list.sorted(s))],
reverse=True)[:3]
[(5, 'a'), (2, 'r'), (2, 'b')]




> > > While extractor
> > > functions can be arbitrarily complex, many only fetch a specific
> > > attribute or element number.  Alex's high-speed curry suggests
that it
> > > is possible to create a function maker for fast lookups:
> > >
> > > students.sort(key=extract('grade'))  # key=lambda r:r.grade
> > > students.sort(key=extract(2))        # key=lambda r:[2]
> >
> > Perhaps we could do this by changing list.sort() and groupby() to
take
> > a string or int as first argument to mean exactly this.  For the
> 
> It seems to be that this would be specialcasing things while an
extract
> function might help in other contexts as well.  E.g., itertools has
> several
> other iterators that take a callable and might use this.
> 
> > But I recommend holding off on this -- the "pure" groupby() has
enough
> > merit without speed hacks, and I find the clarity it provides more
> > important than possible speed gains.  I expect that the original,
ugly
> 
> I agree that the case for extract is separate from that for groupby
> (although
> the latter does increase the attractiveness of the former).

Yes, it's clearly a separate issue (and icing on the cake).  I was
thinking extract() would be a nice addition to the operator module where
everything is basically a lambda evading speed hack for accessing
intrinsic operations:  operator.add = lambda x,y: x+y



Raymond




More information about the Python-Dev mailing list