Here's a reworking which returns iterators. I had to decide what to do if the user tries to access things out of order; I raise an exception. Anything else would complicate the code quite a lot I think.
def groupby(key, iterable): it = iter(iterable) value = it.next() # If there are no items, this takes an early exit oldkey = [key(value)] cache = [value] lock = [] def grouper(): yield cache.pop() for value in it: newkey = key(value) if newkey == oldkey[0]: yield value else: oldkey[0] = newkey cache.append(value) break del lock[0] while 1: if lock: raise LookupError, "groups accessed out of order" if not cache: break lock.append(1) yield grouper()
--Greg Ball
Here's a reworking which returns iterators. I had to decide what to do if the user tries to access things out of order; I raise an exception. Anything else would complicate the code quite a lot I think.
def groupby(key, iterable): it = iter(iterable) value = it.next() # If there are no items, this takes an early exit oldkey = [key(value)] cache = [value] lock = [] def grouper(): yield cache.pop() for value in it: newkey = key(value) if newkey == oldkey[0]: yield value else: oldkey[0] = newkey cache.append(value) break del lock[0] while 1: if lock: raise LookupError, "groups accessed out of order" if not cache: break lock.append(1) yield grouper()
Thanks! Here's a class version of the same, which strikes me as slightly easier to understand (though probably slower due to all the instance variable access). It may serve as an easier model for a C implementation. I decided not to deal explicitly with out-of-order access; if the caller doesn't play by the rules, some of their groups will be split and jumbled, but each split group will still have matching keys.
class GroupBy(object): def __init__(self, key, iterable): self.key = key self.it = iter(iterable) self.todo = [] def __iter__(self): return self def next(self): if self.todo: value, oldkey = self.todo.pop() else: value = self.it.next() # Exit if this raises StopIteration oldkey = self.key(value) return self._grouper(value, oldkey) def _grouper(self, value, oldkey): yield value for value in self.it: newkey = self.key(value) if newkey != oldkey: self.todo.append((value, newkey)) break yield value
This is an example of what's so cool about iterators and generators: You can code a particular idiom or mini-pattern (in this case grouping list items) once and apply it to lots of situations. That's of course what all subroutines do, but iterators and generators open up lots of places where previously it wasn't convenient to use a subroutine (you'd have to use lots of lambdas -- or you'd have to have a language supporting anonymous code blocks, which provide a lot of the same power in a different way).
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
On Thu, Nov 27, 2003 at 01:49:40PM -0800, Guido van Rossum wrote:
Thanks! Here's a class version of the same, which strikes me as slightly easier to understand (though probably slower due to all the instance variable access). It may serve as an easier model for a C implementation. I decided not to deal explicitly with out-of-order access; if the caller doesn't play by the rules, some of their groups will be split and jumbled, but each split group will still have matching keys.
Here's yet another implementation for itertoolsmodule.c. (see attachment) I wrote it after the shower (really!) :)
Regards, Hye-Shik
Here's yet another implementation for itertoolsmodule.c. (see attachment) I wrote it after the shower (really!) :)
Wow! Thanks. Let's all remember to take or showers and maybe Python will become the cleanest programming language. :)
Raymond, what do you think?
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like
totals = {} for key, group in sequence: totals[key] = sum(group)
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
Here's yet another implementation for itertoolsmodule.c. (see attachment) I wrote it after the shower (really!) :)
Wow! Thanks. Let's all remember to take or showers and maybe Python will become the cleanest programming language. :)
Raymond, what do you think?
Yes. I recommend taking showers on a regular basis ;-)
I'll experiment with groupby() for a few more days and see how it feels. The first impression is that it meets all the criteria for becoming an itertool (iters in, iters out; no unexpected memory use; works well with other tools; not readily constructed from existing tools).
At first, the tool seems more special purpose than general purpose. OTOH, it is an excellent solution to a specific class of problems and it makes code much cleaner by avoiding the repeated code block in the non-iterator version.
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like
totals = {} for key, group in sequence: totals[key] = sum(group)
This is a much stronger formulation than the original. It is clear, succinct, expressive, and less error prone.
The implementation would be more complex than the original. If the group is ignored, the outer iterator needs to be smart enough to read through the input iterator until the next group is encountered:
names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P'] firstname = lambda n: n.split()[0] names.sort() unique_first_names = [first for first, _ in groupby(firstname,
names)] ['Barry' , 'Jack', 'Tim']
In experimenting with groupby(), I am starting to see a need for a high speed data extractor function. This need is common to several tools that take function arguments (like list.sort(key=)). While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:
students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]
Raymond
Yes. I recommend taking showers on a regular basis ;-)
Jack Jansen wants me to add: especially right after riding your bicycle to work. And my boss will agree. (Enough for in-jokes that no-one will get. :-)
I'll experiment with groupby() for a few more days and see how it feels. The first impression is that it meets all the criteria for becoming an itertool (iters in, iters out; no unexpected memory use; works well with other tools; not readily constructed from existing tools).
Right.
At first, the tool seems more special purpose than general purpose. OTOH, it is an excellent solution to a specific class of problems and it makes code much cleaner by avoiding the repeated code block in the non-iterator version.
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like
totals = {} for key, group in sequence: totals[key] = sum(group)
Oops, there's a mistake. I meant to say:
totals = {} for key, group in groupby(keyfunc, sequence): totals[key] = sum(group)
This is a much stronger formulation than the original. It is clear, succinct, expressive, and less error prone.
I'm not sure to what extent this praise was inspired by my mistake of leaving out the groupby() call.
The implementation would be more complex than the original.
To the contrary. It was a microscopic change to either of the Python versions I posted, because the key to be returned is always available at exactly the right time.
If the group is ignored, the outer iterator needs to be smart enough to read through the input iterator until the next group is encountered:
names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P'] firstname = lambda n: n.split()[0] names.sort() unique_first_names = [first for first, _ in groupby(firstname,
names)] ['Barry' , 'Jack', 'Tim']
I don't think those semantics should be implemented. You should be required to iterate through each group. I was just thinking that returning the key might save the caller cumbersome logic if the key is needed but the inner iterator is also needed. The sum-by-group example would become much uglier:
totals = {} for group in groupby(keyfunc, sequence): first = group.next() key = keyfunc(first) totals[key] = first + sum(group, 0)
In experimenting with groupby(), I am starting to see a need for a high speed data extractor function. This need is common to several tools that take function arguments (like list.sort(key=)).
Exactly: it was definitely inspired by list.sort(key=).
While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:
students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]
Perhaps we could do this by changing list.sort() and groupby() to take a string or int as first argument to mean exactly this. For the string case I had thought of this already (in my second shower today :-); the int case makes sense too. (Though it may weaken my objection against property('foo') in a different thread. :-)
But I recommend holding off on this -- the "pure" groupby() has enough merit without speed hacks, and I find the clarity it provides more important than possible speed gains. I expect that the original, ugly code is usually faster, but in the cases where I've needed this I don't care: either the sequence isn't all that long, or the program doesn't run all that frequently, or it does so much other stuff that the speed gain would be drowned in the noise.
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
On Saturday 29 November 2003 12:41 am, Guido van Rossum wrote: ...
totals = {} for key, group in sequence: totals[key] = sum(group)
Oops, there's a mistake. I meant to say:
totals = {} for key, group in groupby(keyfunc, sequence): totals[key] = sum(group)
This is a much stronger formulation than the original. It is clear, succinct, expressive, and less error prone.
I'm not sure to what extent this praise was inspired by my mistake of leaving out the groupby() call.
Can't answer for RH, but, to me, the groupby call looks just fine.
However, one cosmetic suggestion: for analogy with list.sorted, why not let the call be spelled as groupby(sequence, key=keyfunc) ?
I realize most itertools take a callable _first_, while, to be able to name the key-extractor this way, it would have to go second. I still think it would be nicer, partly because while sequence could not possibly default, key _could_ -- and its one obvious default is to an identity (lambda x: x). This would let elimination and/or counting of adjacent duplicates be expressed smoothly (for counting, it would help to have an ilen that gives the length of a finite iterable argument, but worst case one can substitute def ilen(it): for i, _ in enumerate(it): pass return i+1 or its inline equivalent).
Naming the function 'grouped' rather than 'groupby' would probably be better if the callable was the second arg rather than the first.
names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P'] firstname = lambda n: n.split()[0] names.sort() unique_first_names = [first for first, _ in groupby(firstname,
names)] ['Barry' , 'Jack', 'Tim']
I don't think those semantics should be implemented. You should be required to iterate through each group. I was just thinking that
Right, so basically it would have to be nested like:
ufn = [ f for g in groupby(firstname, names) for f, _ in g ]
In experimenting with groupby(), I am starting to see a need for a high speed data extractor function. This need is common to several tools that take function arguments (like list.sort(key=)).
Exactly: it was definitely inspired by list.sort(key=).
That's part of why I'd love to be able to spell key= for this iterator too.
While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:
students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]
Perhaps we could do this by changing list.sort() and groupby() to take a string or int as first argument to mean exactly this. For the
It seems to be that this would be specialcasing things while an extract function might help in other contexts as well. E.g., itertools has several other iterators that take a callable and might use this.
But I recommend holding off on this -- the "pure" groupby() has enough merit without speed hacks, and I find the clarity it provides more important than possible speed gains. I expect that the original, ugly
I agree that the case for extract is separate from that for groupby (although the latter does increase the attractiveness of the former).
Alex
[Alex]
However, one cosmetic suggestion: for analogy with list.sorted, why not let the call be spelled as groupby(sequence, key=keyfunc) ?
I realize most itertools take a callable _first_, while, to be able to name the key-extractor this way, it would have to go second. I still think it would be nicer, partly because while sequence could not possibly default, key _could_ -- and its one obvious default is to an identity (lambda x: x). This would let elimination and/or counting of adjacent duplicates be expressed smoothly (for counting, it would help to have an ilen that gives the length of a finite iterable
argument,
but worst case one can substitute def ilen(it): for i, _ in enumerate(it): pass return i+1 or its inline equivalent).
Though the argument order makes my stomach churn, the identity function default is quite nice:
s = 'abracadabra;
# sort s | uniq [k for k, g in groupby(list.sorted(s))]
['a', 'b', 'c', 'd', 'r']
# sort s | uniq -d [k for k, g in groupby(list.sorted('abracadabra')) if ilen(g)>1]
['a', 'b', 'r']
# sort s | uniq -c [(ilen(g), k) for k, g in groupby(list.sorted(s))]
[(5, 'a'), (2, 'b'), (1, 'c'), (1, 'd'), (2, 'r')]
sort s | uniq -c | sort -rn | head -3 list.sorted([(ilen(g), k) for k, g in groupby(list.sorted(s))],
reverse=True)[:3] [(5, 'a'), (2, 'r'), (2, 'b')]
While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests
that it
is possible to create a function maker for fast lookups:
students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]
Perhaps we could do this by changing list.sort() and groupby() to
take
a string or int as first argument to mean exactly this. For the
It seems to be that this would be specialcasing things while an
extract
function might help in other contexts as well. E.g., itertools has several other iterators that take a callable and might use this.
But I recommend holding off on this -- the "pure" groupby() has
enough
merit without speed hacks, and I find the clarity it provides more important than possible speed gains. I expect that the original,
ugly
I agree that the case for extract is separate from that for groupby (although the latter does increase the attractiveness of the former).
Yes, it's clearly a separate issue (and icing on the cake). I was thinking extract() would be a nice addition to the operator module where everything is basically a lambda evading speed hack for accessing intrinsic operations: operator.add = lambda x,y: x+y
Raymond
Way to go, Raymond.
One suggestion: instead of ilen(), I would suggest count(). (Yes, I've been using more SQL lately. :-)
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
On Fri, Nov 28, 2003 at 06:24:30PM -0500, Raymond Hettinger wrote: ...
students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]
Why should the extract function interpret a string argument as getattr and an int argument as getitem?
I find the explicit version more readable:
students.sort(key=attrgetter('grade')) # key=lambda r:r.grade students.sort(key=itemgetter(2)) # key=lambda r:[2] students.sort(key=itemgetter('grade')) # key=lambda r:r['grade']
Oren
On Sunday 30 November 2003 08:31, Oren Tirosh wrote:
On Fri, Nov 28, 2003 at 06:24:30PM -0500, Raymond Hettinger wrote: ...
students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]
Why should the extract function interpret a string argument as getattr and an int argument as getitem?
I find the explicit version more readable:
students.sort(key=attrgetter('grade')) # key=lambda r:r.grade students.sort(key=itemgetter(2)) # key=lambda r:[2] students.sort(key=itemgetter('grade')) # key=lambda r:r['grade']
I concur: "overloading" extract to mean (the equivalent of) either getattr or getitem depending on the argument type doesn't look good, besides making it unusable to extract some items from dicts.
Since these functions or types are going to be in operator, I think we can afford to "spend" two names to distinguish functionality (even though attgetter and itemgetter look nowhere as neat as extract -- I don't have better suggestions offhand).
Alex
I concur: "overloading" extract to mean (the equivalent of) either getattr or getitem depending on the argument type doesn't look good, besides making it unusable to extract some items from dicts.
Agreed. I've seen too many of such "clever" overloading schemes in a past life.
Since these functions or types are going to be in operator, I think we can afford to "spend" two names to distinguish functionality (even though attgetter and itemgetter look nowhere as neat as extract -- I don't have better suggestions offhand).
Right.
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
At 09:57 PM 11/30/03 +0100, Alex Martelli wrote:
students.sort(key=attrgetter('grade')) # key=lambda r:r.grade students.sort(key=itemgetter(2)) # key=lambda r:[2] students.sort(key=itemgetter('grade')) # key=lambda r:r['grade']
I concur: "overloading" extract to mean (the equivalent of) either getattr or getitem depending on the argument type doesn't look good, besides making it unusable to extract some items from dicts.
Since these functions or types are going to be in operator, I think we can afford to "spend" two names to distinguish functionality (even though attgetter and itemgetter look nowhere as neat as extract -- I don't have better suggestions offhand).
How about:
extract(attr='grade') extract(item=2) extract(method='foo') # returns the result of calling 'ob.foo()'
And following the pattern of Zope's old "query" package:
extract(extract(attr='foo'), attr='bar') # extracts ob.foo.bar extract(extract(item=10), method='spam') # extracts ob[10].spam()
i.e., the first (optional) positional argument to extract is a function that's called on the outer extract's argument, and the return value is then used to perform the main extract operation on.
IIRC, the Zope query package used __getitem__ instead of __call__ on its instances as a speed hack, but I don't think we should follow that example. :)
How about:
extract(attr='grade') extract(item=2) extract(method='foo') # returns the result of calling 'ob.foo()'
And following the pattern of Zope's old "query" package:
extract(extract(attr='foo'), attr='bar') # extracts ob.foo.bar extract(extract(item=10), method='spam') # extracts ob[10].spam()
i.e., the first (optional) positional argument to extract is a function that's called on the outer extract's argument, and the return value is then used to perform the main extract operation on.
I'm not sure what the advantage of this is. It seems more typing, more explanation, probably more code to implement (to check for contradicting keyword args).
IIRC, the Zope query package used __getitem__ instead of __call__ on its instances as a speed hack, but I don't think we should follow that example. :)
Right. :)
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
Guido van Rossum guido@python.org writes:
How about:
extract(attr='grade') extract(item=2) extract(method='foo') # returns the result of calling 'ob.foo()'
And following the pattern of Zope's old "query" package:
extract(extract(attr='foo'), attr='bar') # extracts ob.foo.bar extract(extract(item=10), method='spam') # extracts ob[10].spam()
i.e., the first (optional) positional argument to extract is a function that's called on the outer extract's argument, and the return value is then used to perform the main extract operation on.
I'm not sure what the advantage of this is. It seems more typing, more explanation, probably more code to implement (to check for contradicting keyword args).
Hm, couldn't "lambda ob: ob.foo.bar" return exactly the same thing as
"extract(extract(attr='foo'), attr='bar')"
? In other words: return specialized C implemented functions for simple lambda expressions?
Thomas
Alex Martelli aleaxit@yahoo.com writes:
Since these functions or types are going to be in operator, I think we can afford to "spend" two names to distinguish functionality (even though attgetter and itemgetter look nowhere as neat as extract -- I don't have better suggestions offhand).
operator.getattr and operator.getitem?
Paul.
On Tuesday 02 December 2003 11:02 am, Paul Moore wrote:
Alex Martelli aleaxit@yahoo.com writes:
Since these functions or types are going to be in operator, I think we can afford to "spend" two names to distinguish functionality (even though attgetter and itemgetter look nowhere as neat as extract -- I don't have better suggestions offhand).
operator.getattr and operator.getitem?
I think the functions they're talking about are actually right-curried versions of those functions, like this:
def attrgetter(attr): return lambda x: getattr(x, attr)
Jeremy
Jeremy Fincher fincher.8@osu.edu writes:
On Tuesday 02 December 2003 11:02 am, Paul Moore wrote:
Alex Martelli aleaxit@yahoo.com writes:
Since these functions or types are going to be in operator, I think we can afford to "spend" two names to distinguish functionality (even though attgetter and itemgetter look nowhere as neat as extract -- I don't have better suggestions offhand).
operator.getattr and operator.getitem?
I think the functions they're talking about are actually right-curried versions of those functions, like this:
I know - I was hoping that the fact that these particular versions were in the operator module might justify using the same name for two mildly-related things.
But it's probably more confusing than helpful.
Paul
[Guido]
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like
totals = {} for key, group in sequence: totals[key] = sum(group)
Here is an implementation that translates readily into C. It uses Guido's syntax and meets my requirement that bad things don't happen when someone runs the outer iterator independently of the inner iterator.
class groupby(object): __slots__ = ('keyfunc', 'it', 'tgtkey', 'currkey', 'currvalue') def __init__(self, key, iterable): NULL = 1+909.9j # In C, use the real NULL self.keyfunc = key self.it = iter(iterable) self.tgtkey = NULL self.currkey = NULL self.currvalue = NULL def __iter__(self): return self def next(self): while self.currkey == self.tgtkey: self.currvalue = self.it.next() # Exit on StopIteration self.currkey = self.keyfunc(self.currvalue) self.tgtkey = self.currkey return (self.currkey, self._grouper(self.currkey)) def _grouper(self, tgtkey): while self.currkey == tgtkey: yield self.currvalue self.currvalue = self.it.next() # Exit on StopIteration self.currkey = self.keyfunc(self.currvalue)
import unittest
class TestBasicOps(unittest.TestCase):
def test_groupby(self): # Check zero length input self.assertEqual([], list(groupby(lambda r:r[0], [])))
# Check normal input s = [(0, 10, 20), (0, 11,21), (0,12,21), (1,13,21), (1,14,22), (2,15,22), (3,16,23), (3,17,23)] dup = [] for k, g in groupby(lambda r:r[0], s): for elem in g: self.assertEqual(k, elem[0]) dup.append(elem) self.assertEqual(s, dup)
# Check nested case dup = [] for k, g in groupby(lambda r:r[0], s): for ik, ig in groupby(lambda r:r[2], g): for elem in ig: self.assertEqual(k, elem[0]) self.assertEqual(ik, elem[2]) dup.append(elem) self.assertEqual(s, dup)
# Check case where inner iterator is not used keys = [] for k, g in groupby(lambda r:r[0], s): keys.append(k) expectedkeys = set([r[0] for r in s]) self.assertEqual(set(keys), expectedkeys) self.assertEqual(len(keys), len(expectedkeys))
suite = unittest.TestSuite() suite.addTest(unittest.makeSuite(TestBasicOps)) unittest.TextTestRunner(verbosity=2).run(suite)
Raymond
In article 000101c3b63f$c7fc4720$e841fea9@oemcomputer, "Raymond Hettinger" python@rcn.com wrote:
Here is an implementation that translates readily into C. It uses Guido's syntax and meets my requirement that bad things don't happen when someone runs the outer iterator independently of the inner iterator.
If I understand your code correctly, running the outer iterator skips over any uniterated values from the inner iterator. I'd be happier with behavior like tee: the inner groups always return the same sequences of items, whether or not the inner iteration happens before the next outer iteration, but the memory cost is only small if you iterate through them in the expected order. E.g., see the "out of order" unit test in the code below.
def identity(x): return x
def groupby(iterable,key=identity): it = iter(iterable) first = it.next() while 1: group = bygroup(it,first,key) yield key(first),group first = group.nextgroup()
class bygroup: """Iterator of items in a single group."""
def __init__(self, iterable, first, key=identity): """Instance variables: - self.lookahead: reversed list of items still to be output - self.groupid: group identity - self.key: func to turn iterated items into group ids - self.it: iterator, or None once we reach another group - self.postfinal: None (only valid once self.it is None) """ self.key = key self.it = iter(iterable) self.lookahead = [first] self.groupid = self.key(first)
def __iter__(self): return self
def group(self): return self.groupid
def next(self): if self.lookahead: return self.lookahead.pop() if self.it is None: raise StopIteration x = self.it.next() if self.key(x) == self.groupid: return x self.postfinal = x self.it = None raise StopIteration
def nextgroup(self): """Return first item of next group. Raises StopIteration if there is no next group.""" if self.it is not None: L = list(self) L.reverse() self.lookahead = L if self.it is not None: raise StopIteration return self.postfinal
import unittest from sets import Set as set
class TestBasicOps(unittest.TestCase):
def test_groupby(self): # Check zero length input self.assertEqual([], list(groupby([],lambda r:r[0])))
# Check normal input s = [(0, 10, 20), (0, 11,21), (0,12,21), (1,13,21), (1,14,22), (2,15,22), (3,16,23), (3,17,23)] dup = [] for k, g in groupby(s, lambda r:r[0]): for elem in g: self.assertEqual(k, elem[0]) dup.append(elem) self.assertEqual(s, dup)
# Check case where groups are iterated out of order nest1 = [] for k,g in groupby(s, lambda r:r[0]): nest1.append(list(g)) nest2 = [] for k,g in groupby(s, lambda r:r[0]): nest2.append(g) nest2 = [list(g) for g in nest2] self.assertEqual(nest1,nest2)
# Check nested case dup = [] for k, g in groupby(s, lambda r:r[0]): for ik, ig in groupby(g, lambda r:r[2]): for elem in ig: self.assertEqual(k, elem[0]) self.assertEqual(ik, elem[2]) dup.append(elem) self.assertEqual(s, dup)
# Check case where inner iterator is not used keys = [] for k, g in groupby(s, lambda r:r[0]): keys.append(k) expectedkeys = set([r[0] for r in s]) self.assertEqual(set(keys), expectedkeys) self.assertEqual(len(keys), len(expectedkeys))
suite = unittest.TestSuite() suite.addTest(unittest.makeSuite(TestBasicOps)) unittest.TextTestRunner(verbosity=2).run(suite)
On Sat, Nov 29, 2003 at 01:12:34AM -0500, Raymond Hettinger wrote:
[Guido]
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like
totals = {} for key, group in groupby(sequence): totals[key] = sum(group)
Heh. I love that!
Here is an implementation that translates readily into C. It uses Guido's syntax and meets my requirement that bad things don't happen when someone runs the outer iterator independently of the inner iterator.
I updated my implementation according to your guideline. Please see attachments. Docstrings are still insufficient due to my english shortage. :)
Thanks!
Regards, Hye-Shik
I lost David Eppstein's post, but I finally know what I want to say in response. David objected to the behavior of the groupby() subiterators to become invalidated when the outer iterator is moved on to the next subiterator. But I don't think there's a good use case for what he wants to do instead: save enough state so that the subiterators can be used in arbitrary order. An application that saves the subiterators for later will end up saving a copy of everything, so it might as well be written so explicitly, e.g.:
store = {} for key, group in groupby(keyfunc, iterable): store[key] = list(group) # now access the groups in random order: for key in store: print store[key]
I don't think the implementation should be complexified to allow leaving out the explicit list() call in the first loop.
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
In article 200311301650.hAUGofH28925@c-24-5-183-134.client.comcast.net, Guido van Rossum guido@python.org wrote:
But I don't think there's a good use case for what he wants to do instead: save enough state so that the subiterators can be used in arbitrary order. An application that saves the subiterators for later will end up saving a copy of everything, so it might as well be written so explicitly
I don't have a good explicit use case in mind, but my objective is to be able to use itertools-like functionals without having to pay much attention to which ones iterate through their arguments immediately and which ones defer the iteration until later.
[Guido van Rossum]
But I don't think there's a good use case for what he wants to do instead: save enough state so that the subiterators can be used in arbitrary order. An application that saves the subiterators for later will end up saving a copy of everything, so it might as well be written so explicitly
[David Eppstein]
I don't have a good explicit use case in mind, but my objective is to
be
able to use itertools-like functionals without having to pay much attention to which ones iterate through their arguments immediately
and
which ones defer the iteration until later.
Okay, I've decided on this one.
Though David's idea is attractive in its generality, the use cases favor the previous implementation. IOW, there is a reasonable use case for skipping or partially consuming the subiterators (e.g. "sort s | uniq" and "sort s | uniq -d"). For the delinguent subiterators, the user can just convert them to a list if they are going to be needed later:
groups = [] for k, g in groupby(seq, keyfunc): groups.append(list(g)) <do something with k>
With respect to the principle of least surprise, it is the lesser evil between having a delinquent subiterator turn-up empty or having an itertool unexpectedly fall into a memory intensive mode.
The first can be flagged so it won't pass silently. The second is more problematic because it is silent and because it is inconsistent with the memory friendly nature of itertools.
Another minor argument against David's version is that the pure python version (which will be included in the docs) is longer and harder to follow.
Raymond Hettinger
P.S. I'm leaning toward Alex's suggested argument order. Having a default identity function is too attractive to pass up. So the choice is between a style like map(None, s) or something closer to list.sorted(s, key=). Though the latter is not consistent with other itertools, it wins in the beauty department and its similarity with the key= is a accurate, helpful analogy.
On 11/30/03 11:35 PM -0500 Raymond Hettinger python@rcn.com wrote:
Okay, I've decided on this one.
Though David's idea is attractive in its generality, the use cases favor the previous implementation. IOW, there is a reasonable use case for skipping or partially consuming the subiterators (e.g. "sort s | uniq" and "sort s | uniq -d"). For the delinguent subiterators, the user can just convert them to a list if they are going to be needed later:
My implementation will skip or partially consume the subiterators, with only a very temporary additional use of memory, if you don't keep a reference to them. But I can see your arguments about visible vs silent failure modes and code complexity.
P.S. I'm leaning toward Alex's suggested argument order. Having a default identity function is too attractive to pass up. So the choice is between a style like map(None, s) or something closer to list.sorted(s, key=). Though the latter is not consistent with other itertools, it wins in the beauty department and its similarity with the key= is a accurate, helpful analogy.
Good. Have you thought about the name yet? Will it be grouped() or grouby()?
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)
[David Eppstein]
My implementation will skip or partially consume the subiterators,
with
only a very temporary additional use of memory, if you don't keep a reference to them. But I can see your arguments about visible vs silent failure modes and code complexity.
I'll add note saying that the behavior may change. Then, if I was wrong, we can add support for delinquent subiterators.
[Guido]
Have you thought about the name yet? Will it be grouped() or grouby()?
This question is still open.
The grouped() name fits well with its list.sorted() and reversed(). Also, it is more natural in context of a default identity function.
The groupby() name better suggests what's going on under the hood. The strong association with SQL is a plus because the analogy is accurate.
I'm leaning away from grouped() because it seems vague and lifeless. Other tools like izip() and the prospective window() and roundrobin() functions could also be said to return groups.
[After Guido suggests introducing an SQLlike itertools.count()] I think this belongs in a new module for accumulator functions including average, standard deviation, and such. In the meantime, len(list(grp)) will suffice.
That is even more reasonable when multiple accumulation functions are used because a list has to be created anyway:
for k, g in groupby(s, key=keyfunc): data = list(g) print s, len(data), sum(data), average(data)
[Guido on chained attribute lookups]
Anyway, it's all moot -- Raymond just added operator.itemgetter and operator.attrgetter.
The implementation does not preclude someone adding support for chained attribute lookups. IMO, the non-chained version covers the most common use cases simply and directly. For complex structures, lambda already has plenty of versatility:
key = lambda r: r.roster[3].personalinfo.physicaladdress.zipcode
IIRC, Tim recently yearned for the days when the lack of support for nested structures pushed people towards flatter designs. With him already looking forward to death, we dare not add to his angst.
Raymond Hettinger
On Fri, Nov 28, 2003 at 01:46:53PM -0800, Guido van Rossum wrote:
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like
totals = {} for key, group in sequence: totals[key] = sum(group)
I think this would be helpful for lazy coders if some function of itertools cover the use case: (`LIMIT' keyword of SQL)
from groupby import groupby alwaystrue = lambda n: True for k, g in groupby(alwaystrue, range(20), 5):
... print list(g) ... [0, 1, 2, 3, 4] [5, 6, 7, 8, 9] [10, 11, 12, 13, 14] [15, 16, 17, 18, 19]
prototype of this case is groupby(keyfunc, iterable, limit=None). Either, groupby(5, range(20)) is okay but 5 is not a sort of `key'. :)
Hye-Shik
I think this would be helpful for lazy coders if some function of itertools cover the use case: (`LIMIT' keyword of SQL)
from groupby import groupby alwaystrue = lambda n: True for k, g in groupby(alwaystrue, range(20), 5):
... print list(g) ... [0, 1, 2, 3, 4] [5, 6, 7, 8, 9] [10, 11, 12, 13, 14] [15, 16, 17, 18, 19]
prototype of this case is groupby(keyfunc, iterable, limit=None). Either, groupby(5, range(20)) is okay but 5 is not a sort of `key'. :)
I'd rather not weigh down groupby() with more options. If you really want only the first 5 of each group, you can use itertools.islice():
for k, g in groupby(alwaystrue, range(20)): print list(itertools.islice(g, 0, 5))
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)