
Yes. I recommend taking showers on a regular basis ;-)
Jack Jansen wants me to add: especially right after riding your bicycle to work. And my boss will agree. (Enough for in-jokes that no-one will get. :-)
I'll experiment with groupby() for a few more days and see how it feels. The first impression is that it meets all the criteria for becoming an itertool (iters in, iters out; no unexpected memory use; works well with other tools; not readily constructed from existing tools).
Right.
At first, the tool seems more special purpose than general purpose. OTOH, it is an excellent solution to a specific class of problems and it makes code much cleaner by avoiding the repeated code block in the non-iterator version.
I would make one change: after looking at another use case, I'd like to change the outer iterator to produce (key, grouper) tuples. This way, you can write things like
totals = {} for key, group in sequence: totals[key] = sum(group)
Oops, there's a mistake. I meant to say: totals = {} for key, group in groupby(keyfunc, sequence): totals[key] = sum(group)
This is a much stronger formulation than the original. It is clear, succinct, expressive, and less error prone.
I'm not sure to what extent this praise was inspired by my mistake of leaving out the groupby() call.
The implementation would be more complex than the original.
To the contrary. It was a microscopic change to either of the Python versions I posted, because the key to be returned is always available at exactly the right time.
If the group is ignored, the outer iterator needs to be smart enough to read through the input iterator until the next group is encountered:
names = ['Tim D', 'Jack D', 'Jack J', 'Barry W', 'Tim P'] firstname = lambda n: n.split()[0] names.sort() unique_first_names = [first for first, _ in groupby(firstname, names)] ['Barry' , 'Jack', 'Tim']
I don't think those semantics should be implemented. You should be required to iterate through each group. I was just thinking that returning the key might save the caller cumbersome logic if the key is needed but the inner iterator is also needed. The sum-by-group example would become much uglier: totals = {} for group in groupby(keyfunc, sequence): first = group.next() key = keyfunc(first) totals[key] = first + sum(group, 0)
In experimenting with groupby(), I am starting to see a need for a high speed data extractor function. This need is common to several tools that take function arguments (like list.sort(key=)).
Exactly: it was definitely inspired by list.sort(key=).
While extractor functions can be arbitrarily complex, many only fetch a specific attribute or element number. Alex's high-speed curry suggests that it is possible to create a function maker for fast lookups:
students.sort(key=extract('grade')) # key=lambda r:r.grade students.sort(key=extract(2)) # key=lambda r:[2]
Perhaps we could do this by changing list.sort() and groupby() to take a string or int as first argument to mean exactly this. For the string case I had thought of this already (in my second shower today :-); the int case makes sense too. (Though it may weaken my objection against property('foo') in a different thread. :-) But I recommend holding off on this -- the "pure" groupby() has enough merit without speed hacks, and I find the clarity it provides more important than possible speed gains. I expect that the original, ugly code is usually faster, but in the cases where I've needed this I don't care: either the sequence isn't all that long, or the program doesn't run all that frequently, or it does so much other stuff that the speed gain would be drowned in the noise. --Guido van Rossum (home page: http://www.python.org/~guido/)