Mailman 3 "groupby" iterator - Python-Dev

"groupby" iterator

older
RE: [Python-Dev] Tutorial: Brief...

Greg Ball

Nov. 27, 2003

3:58 p.m.

Here's a reworking which returns iterators. I had to decide what to do if the user tries to access things out of order; I raise an exception. Anything else would complicate the code quite a lot I think. def groupby(key, iterable): it = iter(iterable) value = it.next() # If there are no items, this takes an early exit oldkey = [key(value)] cache = [value] lock = [] def grouper(): yield cache.pop() for value in it: newkey = key(value) if newkey == oldkey[0]: yield value else: oldkey[0] = newkey cache.append(value) break del lock[0] while 1: if lock: raise LookupError, "groups accessed out of order" if not cache: break lock.append(1) yield grouper() --Greg Ball

Show replies by date

Guido van Rossum

November 2003

4:49 p.m.

...

Thanks! Here's a class version of the same, which strikes me as slightly easier to understand (though probably slower due to all the instance variable access). It may serve as an easier model for a C implementation. I decided not to deal explicitly with out-of-order access; if the caller doesn't play by the rules, some of their groups will be split and jumbled, but each split group will still have matching keys. class GroupBy(object): def __init__(self, key, iterable): self.key = key self.it = iter(iterable) self.todo = [] def __iter__(self): return self def next(self): if self.todo: value, oldkey = self.todo.pop() else: value = self.it.next() # Exit if this raises StopIteration oldkey = self.key(value) return self._grouper(value, oldkey) def _grouper(self, value, oldkey): yield value for value in self.it: newkey = self.key(value) if newkey != oldkey: self.todo.append((value, newkey)) break yield value This is an example of what's so cool about iterators and generators: You can code a particular idiom or mini-pattern (in this case grouping list items) once and apply it to lots of situations. That's of course what all subroutines do, but iterators and generators open up lots of places where previously it wasn't convenient to use a subroutine (you'd have to use lots of lambdas -- or you'd have to have a language supporting anonymous code blocks, which provide a lot of the same power in a different way). --Guido van Rossum (home page: http://www.python.org/~guido/)

Hye-Shik Chang

2:54 p.m.

On Thu, Nov 27, 2003 at 01:49:40PM -0800, Guido van Rossum wrote:

...

Here's yet another implementation for itertoolsmodule.c. (see attachment) I wrote it after the shower (really!) :) Regards, Hye-Shik

Guido van Rossum

4:46 p.m.

...

Here's yet another implementation for itertoolsmodule.c. (see attachment) I wrote it after the shower (really!) :)

Wow! Thanks. Let's all remember to take or showers and maybe Python will become the cleanest programming language. :) Raymond, what do you think? I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like totals = {} for key, group in sequence: totals[key] = sum(group) --Guido van Rossum (home page: http://www.python.org/~guido/)

Raymond Hettinger

6:24 p.m.

...

Yes. I recommend taking showers on a regular basis ;-) I'll experiment with groupby() for a few more days and see how it feels. The first impression is that it meets all the criteria for becoming an itertool (iters in, iters out; no unexpected memory use; works well with other tools; not readily constructed from existing tools). At first, the tool seems more special purpose than general purpose. OTOH, it is an excellent solution to a specific class of problems and it makes code much cleaner by avoiding the repeated code block in the non-iterator version.

...

This is a much stronger formulation than the original. It is clear, succinct, expressive, and less error prone. The implementation would be more complex than the original. If the group is ignored, the outer iterator needs to be smart enough to read through the input iterator until the next group is encountered:

...

In experimenting with groupby(), I am starting to see a need for a high speed data extractor function. This need is common to several tools that take function arguments (like list.sort(key=)). While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups: students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2] Raymond

Guido van Rossum

6:41 p.m.

...

Yes. I recommend taking showers on a regular basis ;-)

Jack Jansen wants me to add: especially right after riding your bicycle to work. And my boss will agree. (Enough for in-jokes that no-one will get. :-)

...

I'll experiment with groupby() for a few more days and see how it feels. The first impression is that it meets all the criteria for becoming an itertool (iters in, iters out; no unexpected memory use; works well with other tools; not readily constructed from existing tools).

Right.

...

At first, the tool seems more special purpose than general purpose. OTOH, it is an excellent solution to a specific class of problems and it makes code much cleaner by avoiding the repeated code block in the non-iterator version.

...
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like

totals = {} for key, group in sequence: totals[key] = sum(group)

Oops, there's a mistake. I meant to say: totals = {} for key, group in groupby(keyfunc, sequence): totals[key] = sum(group)

...

This is a much stronger formulation than the original. It is clear, succinct, expressive, and less error prone.

I'm not sure to what extent this praise was inspired by my mistake of leaving out the groupby() call.

...

The implementation would be more complex than the original.

To the contrary. It was a microscopic change to either of the Python versions I posted, because the key to be returned is always available at exactly the right time.

...

If the group is ignored, the outer iterator needs to be smart enough to read through the input iterator until the next group is encountered:

...
...
...
names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P'] firstname = lambda n: n.split()[0] names.sort() unique_first_names = [first for first, _ in groupby(firstname, names)] ['Barry' , 'Jack', 'Tim']

I don't think those semantics should be implemented. You should be required to iterate through each group. I was just thinking that returning the key might save the caller cumbersome logic if the key is needed but the inner iterator is also needed. The sum-by-group example would become much uglier: totals = {} for group in groupby(keyfunc, sequence): first = group.next() key = keyfunc(first) totals[key] = first + sum(group, 0)

...

In experimenting with groupby(), I am starting to see a need for a high speed data extractor function. This need is common to several tools that take function arguments (like list.sort(key=)).

Exactly: it was definitely inspired by list.sort(key=).

...

While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:

students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]

Perhaps we could do this by changing list.sort() and groupby() to take a string or int as first argument to mean exactly this. For the string case I had thought of this already (in my second shower today :-); the int case makes sense too. (Though it may weaken my objection against property('foo') in a different thread. :-) But I recommend holding off on this -- the "pure" groupby() has enough merit without speed hacks, and I find the clarity it provides more important than possible speed gains. I expect that the original, ugly code is usually faster, but in the cases where I've needed this I don't care: either the sequence isn't all that long, or the program doesn't run all that frequently, or it does so much other stuff that the speed gain would be drowned in the noise. --Guido van Rossum (home page: http://www.python.org/~guido/)

Alex Martelli

2:03 a.m.

On Saturday 29 November 2003 12:41 am, Guido van Rossum wrote: ...

...

Can't answer for RH, but, to me, the groupby call looks just fine. However, one cosmetic suggestion: for analogy with list.sorted, why not let the call be spelled as groupby(sequence, key=keyfunc) ? I realize most itertools take a callable _first_, while, to be able to name the key-extractor this way, it would have to go second. I still think it would be nicer, partly because while sequence could not possibly default, key _could_ -- and its one obvious default is to an identity (lambda x: x). This would let elimination and/or counting of adjacent duplicates be expressed smoothly (for counting, it would help to have an ilen that gives the length of a finite iterable argument, but worst case one can substitute def ilen(it): for i, _ in enumerate(it): pass return i+1 or its inline equivalent). Naming the function 'grouped' rather than 'groupby' would probably be better if the callable was the second arg rather than the first.

...

Right, so basically it would have to be nested like: ufn = [ f for g in groupby(firstname, names) for f, _ in g ]

...

That's part of why I'd love to be able to spell key= for this iterator too.

...

It seems to be that this would be specialcasing things while an extract function might help in other contexts as well. E.g., itertools has several other iterators that take a callable and might use this.

...

I agree that the case for extract is separate from that for groupby (although the latter does increase the attractiveness of the former). Alex

Raymond Hettinger

3:26 a.m.

[Alex]

...

However, one cosmetic suggestion: for analogy with list.sorted, why not let the call be spelled as groupby(sequence, key=keyfunc) ?

I realize most itertools take a callable _first_, while, to be able to name the key-extractor this way, it would have to go second. I still think it would be nicer, partly because while sequence could not possibly default, key _could_ -- and its one obvious default is to an identity (lambda x: x). This would let elimination and/or counting of adjacent duplicates be expressed smoothly (for counting, it would help to have an ilen that gives the length of a finite iterable argument, but worst case one can substitute def ilen(it): for i, _ in enumerate(it): pass return i+1 or its inline equivalent).

Though the argument order makes my stomach churn, the identity function default is quite nice:

...

...
...
s = 'abracadabra;

...

...
...
# sort s | uniq [k for k, g in groupby(list.sorted(s))] ['a', 'b', 'c', 'd', 'r']

...

...
...
# sort s | uniq -d [k for k, g in groupby(list.sorted('abracadabra')) if ilen(g)>1] ['a', 'b', 'r']

...

...
...
# sort s | uniq -c [(ilen(g), k) for k, g in groupby(list.sorted(s))] [(5, 'a'), (2, 'b'), (1, 'c'), (1, 'd'), (2, 'r')]

...

...
...
sort s | uniq -c | sort -rn | head -3 list.sorted([(ilen(g), k) for k, g in groupby(list.sorted(s))], reverse=True)[:3] [(5, 'a'), (2, 'r'), (2, 'b')]

...

...
...
While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:

students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]

Perhaps we could do this by changing list.sort() and groupby() to take a string or int as first argument to mean exactly this. For the

It seems to be that this would be specialcasing things while an extract function might help in other contexts as well. E.g., itertools has several other iterators that take a callable and might use this.

...
But I recommend holding off on this -- the "pure" groupby() has enough merit without speed hacks, and I find the clarity it provides more important than possible speed gains. I expect that the original, ugly

I agree that the case for extract is separate from that for groupby (although the latter does increase the attractiveness of the former).

Yes, it's clearly a separate issue (and icing on the cake). I was thinking extract() would be a nice addition to the operator module where everything is basically a lambda evading speed hack for accessing intrinsic operations: operator.add = lambda x,y: x+y Raymond

Guido van Rossum

1:18 p.m.

Way to go, Raymond. One suggestion: instead of ilen(), I would suggest count(). (Yes, I've been using more SQL lately. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Oren Tirosh

2:31 a.m.

On Fri, Nov 28, 2003 at 06:24:30PM -0500, Raymond Hettinger wrote: ...

...

students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]

Why should the extract function interpret a string argument as getattr and an int argument as getitem? I find the explicit version more readable: students.sort(key=attrgetter('grade')) # key=lambda r:r.grade students.sort(key=itemgetter(2)) # key=lambda r:[2] students.sort(key=itemgetter('grade')) # key=lambda r:r['grade'] Oren

Alex Martelli

3:57 p.m.

On Sunday 30 November 2003 08:31, Oren Tirosh wrote:

...

I concur: "overloading" extract to mean (the equivalent of) either getattr or getitem depending on the argument type doesn't look good, besides making it unusable to extract some items from dicts. Since these functions or types are going to be in operator, I think we can afford to "spend" two names to distinguish functionality (even though attgetter and itemgetter look nowhere as neat as extract -- I don't have better suggestions offhand). Alex

Guido van Rossum

4:54 p.m.

...

Agreed. I've seen too many of such "clever" overloading schemes in a past life.

...

Right. --Guido van Rossum (home page: http://www.python.org/~guido/)

Phillip J. Eby

6:48 p.m.

At 09:57 PM 11/30/03 +0100, Alex Martelli wrote:

...

How about: extract(attr='grade') extract(item=2) extract(method='foo') # returns the result of calling 'ob.foo()' And following the pattern of Zope's old "query" package: extract(extract(attr='foo'), attr='bar') # extracts ob.foo.bar extract(extract(item=10), method='spam') # extracts ob[10].spam() i.e., the first (optional) positional argument to extract is a function that's called on the outer extract's argument, and the return value is then used to perform the main extract operation on. IIRC, the Zope query package used __getitem__ instead of __call__ on its instances as a speed hack, but I don't think we should follow that example. :)

Guido van Rossum

7:18 p.m.

...

I'm not sure what the advantage of this is. It seems more typing, more explanation, probably more code to implement (to check for contradicting keyword args).

...

IIRC, the Zope query package used __getitem__ instead of __call__ on its instances as a speed hack, but I don't think we should follow that example. :)

Right. :) --Guido van Rossum (home page: http://www.python.org/~guido/)

Thomas Heller

December 2003

6:43 a.m.

Guido van Rossum <guido@python.org> writes:

...

Hm, couldn't "lambda ob: ob.foo.bar" return exactly the same thing as "extract(extract(attr='foo'), attr='bar')" ? In other words: return specialized C implemented functions for simple lambda expressions? Thomas

Paul Moore

11:02 a.m.

Alex Martelli <aleaxit@yahoo.com> writes:

...

operator.getattr and operator.getitem? Paul. -- This signature intentionally left blank

Jeremy Fincher

4:54 p.m.

On Tuesday 02 December 2003 11:02 am, Paul Moore wrote:

...

I think the functions they're talking about are actually right-curried versions of those functions, like this: def attrgetter(attr): return lambda x: getattr(x, attr) Jeremy

Paul Moore

1:33 p.m.

Jeremy Fincher <fincher.8@osu.edu> writes:

...

I know - I was hoping that the fact that these particular versions were in the operator module might justify using the same name for two mildly-related things. But it's probably more confusing than helpful. Paul -- This signature intentionally left blank

Raymond Hettinger

November 2003

1:12 a.m.

[Guido]

...

Here is an implementation that translates readily into C. It uses Guido's syntax and meets my requirement that bad things don't happen when someone runs the outer iterator independently of the inner iterator. class groupby(object): __slots__ = ('keyfunc', 'it', 'tgtkey', 'currkey', 'currvalue') def __init__(self, key, iterable): NULL = 1+909.9j # In C, use the real NULL self.keyfunc = key self.it = iter(iterable) self.tgtkey = NULL self.currkey = NULL self.currvalue = NULL def __iter__(self): return self def next(self): while self.currkey == self.tgtkey: self.currvalue = self.it.next() # Exit on StopIteration self.currkey = self.keyfunc(self.currvalue) self.tgtkey = self.currkey return (self.currkey, self._grouper(self.currkey)) def _grouper(self, tgtkey): while self.currkey == tgtkey: yield self.currvalue self.currvalue = self.it.next() # Exit on StopIteration self.currkey = self.keyfunc(self.currvalue) import unittest class TestBasicOps(unittest.TestCase): def test_groupby(self): # Check zero length input self.assertEqual([], list(groupby(lambda r:r[0], []))) # Check normal input s = [(0, 10, 20), (0, 11,21), (0,12,21), (1,13,21), (1,14,22), (2,15,22), (3,16,23), (3,17,23)] dup = [] for k, g in groupby(lambda r:r[0], s): for elem in g: self.assertEqual(k, elem[0]) dup.append(elem) self.assertEqual(s, dup) # Check nested case dup = [] for k, g in groupby(lambda r:r[0], s): for ik, ig in groupby(lambda r:r[2], g): for elem in ig: self.assertEqual(k, elem[0]) self.assertEqual(ik, elem[2]) dup.append(elem) self.assertEqual(s, dup) # Check case where inner iterator is not used keys = [] for k, g in groupby(lambda r:r[0], s): keys.append(k) expectedkeys = set([r[0] for r in s]) self.assertEqual(set(keys), expectedkeys) self.assertEqual(len(keys), len(expectedkeys)) suite = unittest.TestSuite() suite.addTest(unittest.makeSuite(TestBasicOps)) unittest.TextTestRunner(verbosity=2).run(suite) Raymond

David Eppstein

3:14 p.m.

In article <000101c3b63f$c7fc4720$e841fea9@oemcomputer>, "Raymond Hettinger" <python@rcn.com> wrote:

...

If I understand your code correctly, running the outer iterator skips over any uniterated values from the inner iterator. I'd be happier with behavior like tee: the inner groups always return the same sequences of items, whether or not the inner iteration happens before the next outer iteration, but the memory cost is only small if you iterate through them in the expected order. E.g., see the "out of order" unit test in the code below. def identity(x): return x def groupby(iterable,key=identity): it = iter(iterable) first = it.next() while 1: group = bygroup(it,first,key) yield key(first),group first = group.nextgroup() class bygroup: """Iterator of items in a single group.""" def __init__(self, iterable, first, key=identity): """Instance variables: - self.lookahead: reversed list of items still to be output - self.groupid: group identity - self.key: func to turn iterated items into group ids - self.it: iterator, or None once we reach another group - self.postfinal: None (only valid once self.it is None) """ self.key = key self.it = iter(iterable) self.lookahead = [first] self.groupid = self.key(first) def __iter__(self): return self def group(self): return self.groupid def next(self): if self.lookahead: return self.lookahead.pop() if self.it is None: raise StopIteration x = self.it.next() if self.key(x) == self.groupid: return x self.postfinal = x self.it = None raise StopIteration def nextgroup(self): """Return first item of next group. Raises StopIteration if there is no next group.""" if self.it is not None: L = list(self) L.reverse() self.lookahead = L if self.it is not None: raise StopIteration return self.postfinal import unittest from sets import Set as set class TestBasicOps(unittest.TestCase): def test_groupby(self): # Check zero length input self.assertEqual([], list(groupby([],lambda r:r[0]))) # Check normal input s = [(0, 10, 20), (0, 11,21), (0,12,21), (1,13,21), (1,14,22), (2,15,22), (3,16,23), (3,17,23)] dup = [] for k, g in groupby(s, lambda r:r[0]): for elem in g: self.assertEqual(k, elem[0]) dup.append(elem) self.assertEqual(s, dup) # Check case where groups are iterated out of order nest1 = [] for k,g in groupby(s, lambda r:r[0]): nest1.append(list(g)) nest2 = [] for k,g in groupby(s, lambda r:r[0]): nest2.append(g) nest2 = [list(g) for g in nest2] self.assertEqual(nest1,nest2) # Check nested case dup = [] for k, g in groupby(s, lambda r:r[0]): for ik, ig in groupby(g, lambda r:r[2]): for elem in ig: self.assertEqual(k, elem[0]) self.assertEqual(ik, elem[2]) dup.append(elem) self.assertEqual(s, dup) # Check case where inner iterator is not used keys = [] for k, g in groupby(s, lambda r:r[0]): keys.append(k) expectedkeys = set([r[0] for r in s]) self.assertEqual(set(keys), expectedkeys) self.assertEqual(len(keys), len(expectedkeys)) suite = unittest.TestSuite() suite.addTest(unittest.makeSuite(TestBasicOps)) unittest.TextTestRunner(verbosity=2).run(suite) -- David Eppstein http://www.ics.uci.edu/~eppstein/ Univ. of California, Irvine, School of Information & Computer Science

Hye-Shik Chang

5:32 p.m.

On Sat, Nov 29, 2003 at 01:12:34AM -0500, Raymond Hettinger wrote:

...

Heh. I love that!

...

I updated my implementation according to your guideline. Please see attachments. Docstrings are still insufficient due to my english shortage. :) Thanks! Regards, Hye-Shik

Guido van Rossum

11:50 a.m.

I lost David Eppstein's post, but I finally know what I want to say in response. David objected to the behavior of the groupby() subiterators to become invalidated when the outer iterator is moved on to the next subiterator. But I don't think there's a good use case for what he wants to do instead: save enough state so that the subiterators can be used in arbitrary order. An application that saves the subiterators for later will end up saving a copy of everything, so it might as well be written so explicitly, e.g.: store = {} for key, group in groupby(keyfunc, iterable): store[key] = list(group) # now access the groups in random order: for key in store: print store[key] I don't think the implementation should be complexified to allow leaving out the explicit list() call in the first loop. --Guido van Rossum (home page: http://www.python.org/~guido/)

David Eppstein

8:01 p.m.

In article <200311301650.hAUGofH28925@c-24-5-183-134.client.comcast.net>, Guido van Rossum <guido@python.org> wrote:

...

I don't have a good explicit use case in mind, but my objective is to be able to use itertools-like functionals without having to pay much attention to which ones iterate through their arguments immediately and which ones defer the iteration until later. -- David Eppstein http://www.ics.uci.edu/~eppstein/ Univ. of California, Irvine, School of Information & Computer Science

Raymond Hettinger

11:35 p.m.

[Guido van Rossum]

...

[David Eppstein]

...

Okay, I've decided on this one. Though David's idea is attractive in its generality, the use cases favor the previous implementation. IOW, there is a reasonable use case for skipping or partially consuming the subiterators (e.g. "sort s | uniq" and "sort s | uniq -d"). For the delinguent subiterators, the user can just convert them to a list if they are going to be needed later: groups = [] for k, g in groupby(seq, keyfunc): groups.append(list(g)) <do something with k> With respect to the principle of least surprise, it is the lesser evil between having a delinquent subiterator turn-up empty or having an itertool unexpectedly fall into a memory intensive mode. The first can be flagged so it won't pass silently. The second is more problematic because it is silent and because it is inconsistent with the memory friendly nature of itertools. Another minor argument against David's version is that the pure python version (which will be included in the docs) is longer and harder to follow. Raymond Hettinger P.S. I'm leaning toward Alex's suggested argument order. Having a default identity function is too attractive to pass up. So the choice is between a style like map(None, s) or something closer to list.sorted(s, key=). Though the latter is not consistent with other itertools, it wins in the beauty department and its similarity with the key= is a accurate, helpful analogy.

David Eppstein

11:42 p.m.

On 11/30/03 11:35 PM -0500 Raymond Hettinger <python@rcn.com> wrote:

...

My implementation will skip or partially consume the subiterators, with only a very temporary additional use of memory, if you don't keep a reference to them. But I can see your arguments about visible vs silent failure modes and code complexity. -- David Eppstein http://www.ics.uci.edu/~eppstein/ Univ. of California, Irvine, School of Information & Computer Science

Guido van Rossum

December 2003

11:18 a.m.

...

Good. Have you thought about the name yet? Will it be grouped() or grouby()? --Guido van Rossum (home page: http://www.python.org/~guido/)

Raymond Hettinger

2:22 p.m.

[David Eppstein]

...

My implementation will skip or partially consume the subiterators, with only a very temporary additional use of memory, if you don't keep a reference to them. But I can see your arguments about visible vs silent failure modes and code complexity.

I'll add note saying that the behavior may change. Then, if I was wrong, we can add support for delinquent subiterators. [Guido]

...

Have you thought about the name yet? Will it be grouped() or grouby()?

This question is still open. The grouped() name fits well with its list.sorted() and reversed(). Also, it is more natural in context of a default identity function. The groupby() name better suggests what's going on under the hood. The strong association with SQL is a plus because the analogy is accurate. I'm leaning away from grouped() because it seems vague and lifeless. Other tools like izip() and the prospective window() and roundrobin() functions could also be said to return groups. [After Guido suggests introducing an SQLlike itertools.count()] I think this belongs in a new module for accumulator functions including average, standard deviation, and such. In the meantime, len(list(grp)) will suffice. That is even more reasonable when multiple accumulation functions are used because a list has to be created anyway: for k, g in groupby(s, key=keyfunc): data = list(g) print s, len(data), sum(data), average(data) [Guido on chained attribute lookups]

...

Anyway, it's all moot -- Raymond just added operator.itemgetter and operator.attrgetter.

The implementation does not preclude someone adding support for chained attribute lookups. IMO, the non-chained version covers the most common use cases simply and directly. For complex structures, lambda already has plenty of versatility: key = lambda r: r.roster[3].personalinfo.physicaladdress.zipcode IIRC, Tim recently yearned for the days when the lack of support for nested structures pushed people towards flatter designs. With him already looking forward to death, we dare not add to his angst. Raymond Hettinger

Hye-Shik Chang

4:38 p.m.

On Fri, Nov 28, 2003 at 01:46:53PM -0800, Guido van Rossum wrote:

...

I think this would be helpful for lazy coders if some function of itertools cover the use case: (`LIMIT' keyword of SQL)

...

prototype of this case is groupby(keyfunc, iterable, limit=None). Either, groupby(5, range(20)) is okay but 5 is not a sort of `key'. :) Hye-Shik

Guido van Rossum

6:57 p.m.

...

I'd rather not weigh down groupby() with more options. If you really want only the first 5 of each group, you can use itertools.islice(): for k, g in groupby(alwaystrue, range(20)): print list(itertools.islice(g, 0, 5)) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum

November 2003

4:49 p.m.

...

Hye-Shik Chang

2:54 p.m.

On Thu, Nov 27, 2003 at 01:49:40PM -0800, Guido van Rossum wrote:

...

Here's yet another implementation for itertoolsmodule.c. (see attachment) I wrote it after the shower (really!) :) Regards, Hye-Shik

Guido van Rossum

4:46 p.m.

...

Here's yet another implementation for itertoolsmodule.c. (see attachment) I wrote it after the shower (really!) :)

Raymond Hettinger

6:24 p.m.

...

Guido van Rossum

6:41 p.m.

...

Yes. I recommend taking showers on a regular basis ;-)

Jack Jansen wants me to add: especially right after riding your bicycle to work. And my boss will agree. (Enough for in-jokes that no-one will get. :-)

...

I'll experiment with groupby() for a few more days and see how it feels. The first impression is that it meets all the criteria for becoming an itertool (iters in, iters out; no unexpected memory use; works well with other tools; not readily constructed from existing tools).

Right.

...

At first, the tool seems more special purpose than general purpose. OTOH, it is an excellent solution to a specific class of problems and it makes code much cleaner by avoiding the repeated code block in the non-iterator version.

...
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like

totals = {} for key, group in sequence: totals[key] = sum(group)

Oops, there's a mistake. I meant to say: totals = {} for key, group in groupby(keyfunc, sequence): totals[key] = sum(group)

...

This is a much stronger formulation than the original. It is clear, succinct, expressive, and less error prone.

I'm not sure to what extent this praise was inspired by my mistake of leaving out the groupby() call.

...

The implementation would be more complex than the original.

To the contrary. It was a microscopic change to either of the Python versions I posted, because the key to be returned is always available at exactly the right time.

...

If the group is ignored, the outer iterator needs to be smart enough to read through the input iterator until the next group is encountered:

...
...
...
names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P'] firstname = lambda n: n.split()[0] names.sort() unique_first_names = [first for first, _ in groupby(firstname, names)] ['Barry' , 'Jack', 'Tim']

...

In experimenting with groupby(), I am starting to see a need for a high speed data extractor function. This need is common to several tools that take function arguments (like list.sort(key=)).

Exactly: it was definitely inspired by list.sort(key=).

...

While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:

students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]

Alex Martelli

2:03 a.m.

On Saturday 29 November 2003 12:41 am, Guido van Rossum wrote: ...

...

Right, so basically it would have to be nested like: ufn = [ f for g in groupby(firstname, names) for f, _ in g ]

...

That's part of why I'd love to be able to spell key= for this iterator too.

...

I agree that the case for extract is separate from that for groupby (although the latter does increase the attractiveness of the former). Alex

Raymond Hettinger

November 2003

3:26 a.m.

[Alex]

...

However, one cosmetic suggestion: for analogy with list.sorted, why not let the call be spelled as groupby(sequence, key=keyfunc) ?

I realize most itertools take a callable _first_, while, to be able to name the key-extractor this way, it would have to go second. I still think it would be nicer, partly because while sequence could not possibly default, key _could_ -- and its one obvious default is to an identity (lambda x: x). This would let elimination and/or counting of adjacent duplicates be expressed smoothly (for counting, it would help to have an ilen that gives the length of a finite iterable argument, but worst case one can substitute def ilen(it): for i, _ in enumerate(it): pass return i+1 or its inline equivalent).

Though the argument order makes my stomach churn, the identity function default is quite nice:

...

...
...
s = 'abracadabra;

...

...
...
# sort s | uniq [k for k, g in groupby(list.sorted(s))] ['a', 'b', 'c', 'd', 'r']

...

...
...
# sort s | uniq -d [k for k, g in groupby(list.sorted('abracadabra')) if ilen(g)>1] ['a', 'b', 'r']

...

...
...
# sort s | uniq -c [(ilen(g), k) for k, g in groupby(list.sorted(s))] [(5, 'a'), (2, 'b'), (1, 'c'), (1, 'd'), (2, 'r')]

...

...
...
sort s | uniq -c | sort -rn | head -3 list.sorted([(ilen(g), k) for k, g in groupby(list.sorted(s))], reverse=True)[:3] [(5, 'a'), (2, 'r'), (2, 'b')]

...

...
...
While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:

students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]

Perhaps we could do this by changing list.sort() and groupby() to take a string or int as first argument to mean exactly this. For the

It seems to be that this would be specialcasing things while an extract function might help in other contexts as well. E.g., itertools has several other iterators that take a callable and might use this.

...
But I recommend holding off on this -- the "pure" groupby() has enough merit without speed hacks, and I find the clarity it provides more important than possible speed gains. I expect that the original, ugly

I agree that the case for extract is separate from that for groupby (although the latter does increase the attractiveness of the former).

Guido van Rossum

1:18 p.m.

Way to go, Raymond. One suggestion: instead of ilen(), I would suggest count(). (Yes, I've been using more SQL lately. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Oren Tirosh

2:31 a.m.

On Fri, Nov 28, 2003 at 06:24:30PM -0500, Raymond Hettinger wrote: ...

...

students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]

Alex Martelli

3:57 p.m.

On Sunday 30 November 2003 08:31, Oren Tirosh wrote:

...

Guido van Rossum

4:54 p.m.

...

Agreed. I've seen too many of such "clever" overloading schemes in a past life.

...

Right. --Guido van Rossum (home page: http://www.python.org/~guido/)

Phillip J. Eby

6:48 p.m.

At 09:57 PM 11/30/03 +0100, Alex Martelli wrote:

...

Guido van Rossum

November 2003

7:18 p.m.

...

I'm not sure what the advantage of this is. It seems more typing, more explanation, probably more code to implement (to check for contradicting keyword args).

...

IIRC, the Zope query package used __getitem__ instead of __call__ on its instances as a speed hack, but I don't think we should follow that example. :)

Right. :) --Guido van Rossum (home page: http://www.python.org/~guido/)

Thomas Heller

December 2003

6:43 a.m.

Guido van Rossum <guido@python.org> writes:

...

Paul Moore

11:02 a.m.

Alex Martelli <aleaxit@yahoo.com> writes:

...

operator.getattr and operator.getitem? Paul. -- This signature intentionally left blank

Jeremy Fincher

4:54 p.m.

On Tuesday 02 December 2003 11:02 am, Paul Moore wrote:

...

I think the functions they're talking about are actually right-curried versions of those functions, like this: def attrgetter(attr): return lambda x: getattr(x, attr) Jeremy

Paul Moore

1:33 p.m.

Jeremy Fincher <fincher.8@osu.edu> writes:

...

Raymond Hettinger

November 2003

1:12 a.m.

[Guido]

...

David Eppstein

November 2003

8:14 p.m.

In article <000101c3b63f$c7fc4720$e841fea9@oemcomputer>, "Raymond Hettinger" <python@rcn.com> wrote:

...

Hye-Shik Chang

10:32 p.m.

On Sat, Nov 29, 2003 at 01:12:34AM -0500, Raymond Hettinger wrote:

...

Heh. I love that!

...

I updated my implementation according to your guideline. Please see attachments. Docstrings are still insufficient due to my english shortage. :) Thanks! Regards, Hye-Shik

Guido van Rossum

4:50 p.m.

David Eppstein

December 2003

1:01 a.m.

In article <200311301650.hAUGofH28925@c-24-5-183-134.client.comcast.net>, Guido van Rossum <guido@python.org> wrote:

...

Raymond Hettinger

4:35 a.m.

[Guido van Rossum]

...

[David Eppstein]

...

David Eppstein

4:42 a.m.

On 11/30/03 11:35 PM -0500 Raymond Hettinger <python@rcn.com> wrote:

...

Guido van Rossum

December 2003

11:18 a.m.

...

Good. Have you thought about the name yet? Will it be grouped() or grouby()? --Guido van Rossum (home page: http://www.python.org/~guido/)

Raymond Hettinger

2:22 p.m.

[David Eppstein]

...

My implementation will skip or partially consume the subiterators, with only a very temporary additional use of memory, if you don't keep a reference to them. But I can see your arguments about visible vs silent failure modes and code complexity.

I'll add note saying that the behavior may change. Then, if I was wrong, we can add support for delinquent subiterators. [Guido]

...