
In the shower (really!) I was thinking about the old problem of going through a list of items that are supposed to be grouped by some key, and doing something extra at the end of each group. I usually end up doing something ugly like this: oldkey = None for item in sequence: newkey = item.key # this could be any function of item if newkey != oldkey and oldkey is not None: ...do group processing... oldkey = newkey ...do item processing... ...do group processing... # for final group This is ugly because the group processing code has to be written twice (or turned into a mini-subroutine); it also doesn't handle empty sequences correctly. Solutions based on using an explicit index and peeking ahead are similarly cumbersome and hard to get right for all end cases. So I realized this is easy to do with a generator, assuming we can handle keeping a list of all items in a group. Here's the generator: def groupby(key, iterable): it = iter(iterable) value = it.next() # If there are no items, this takes an early exit oldkey = key(value) group = [value] for value in it: newkey = key(value) if newkey != oldkey: yield group group = [] oldkey = newkey group.append(value) yield group Here's the usage ("item.key" is just an example): for group in groupby(lambda item: item.key, sequence): for item in group: ...item processing... ...group processing... The only caveat is that if a group is very large, this accumulates all its items in a large list. I expect the generator can be reworked to return an iterator instead, but getting the details worked out seems too much work for a summy Thanskgiving morning. :-) Example: # Print lines of /etc/passwd, sorted, grouped by first letter lines = open("/etc/passwd").readlines() lines.sort() for group in groupby(lambda s: s[0], lines): print "-"*10 for line in group: print line, print "-"*10 Maybe Raymond can add this to the itertools module? Or is there a more elegant approach than my original code that I've missed all these years? --Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido, on grouping elements of a sequence by key]
I've always done it like: d = {} for x in sequence: d.setdefault(key(x), []).append(x) # Now d has partitioned sequence by key. The keys are # available as d.keys(), the associated groups as d.values(). # So, e.g., for key, group in d.iteritems(): d[key] = sum(group) There's no code duplication, or warts for an empty sequence, which are the ugly parts of the non-dict approach. It doesn't matter here whether the elements orginally appear with equal keys all adjacent, and input often isn't sorted that way. When it isn't, not needing to sort first can be a major time savings if the sequence is big. Against it, a dict is a large data structure. I don't think it's ever been a real problem that it requires keys to be hashable. groupby() looks very nice when it applies.
Or totals = dict((key, sum(group)) for key, group in groupby(keyfunc, sequence)) exploiting generator expressions too. [after Raymond wonders about cases where the consumer doesn't iterate over the group generators ]
I don't think those semantics should be implemented. You should be required to iterate through each group.
Brrrr. Sounds error-prone (hard to explain, and impossible to enforce unless the implementation does almost all the work it would need to allow groups to get skipped -- if the implementation can detect that a group hasn't been fully iterated, then it could almost as easily go on to skip over remaining equal keys itself instead of whining about it; but if the implementation can't detect it, accidental violations of the requirement will be hard to track down). You're a security guy now. You've got a log with line records of the form month day hhmmss severity_level threat_id It's sorted ascending by month then desceding by severity_level. You want a report of the top 10 threats seen each month. for month, lines in groupby(lamdba s: s.split()[0], input_file): print month print itertools.islice(lines, 10) Like array[:10], islice() does the right thing if there are fewer than 10 lines in a month. It's just not natural to require that an iterator be run to exhaustion (if it *were* natural, this wouldn't be the first context ever to require it <wink>).

The major downside of this is that this keeps everything in memory. When that's acceptable, it's a great approach (especially because it doesn't require sorting). But often you really want to be able to handle input of arbitrary size. For example, suppose you are given a file with some kind of records, timestamped and maintained in chronological order (e.g. a log file -- perfect example of data that won't fit in memory and is already sorted). You're supposed to output this for printing, while inserting a header at the start of each day and a footer at the end of each day with various counts or totals per day.
groupby() looks very nice when it applies.
Right. :-)
Nice. When can we get these? :-)
I take it back after seeing Raymond's implementation -- it's simple enough to make sure that each group is exhausted before starting the next group, and this is clearly the "natural" semantics. --Guido van Rossum (home page: http://www.python.org/~guido/)

Unless someone in the know volunteers, it will need to wait until Christmas vacation. Currently, the implementation is beyond my skill level. It will take a while raise my skills to cover adding new syntax and what to do in the compiler. Raymond

[Raymond Hettinger]
[Neil Schemenauer]
I wonder if we should try to finish the new compiler first.
That's the rub -- if I have time to move 2.4 along, I'll first give it to advancing the AST branch. Teaching the current front end new parsing tricks would be an exercise is obsolescence.

[Guido, on grouping elements of a sequence by key]
I've always done it like: d = {} for x in sequence: d.setdefault(key(x), []).append(x) # Now d has partitioned sequence by key. The keys are # available as d.keys(), the associated groups as d.values(). # So, e.g., for key, group in d.iteritems(): d[key] = sum(group) There's no code duplication, or warts for an empty sequence, which are the ugly parts of the non-dict approach. It doesn't matter here whether the elements orginally appear with equal keys all adjacent, and input often isn't sorted that way. When it isn't, not needing to sort first can be a major time savings if the sequence is big. Against it, a dict is a large data structure. I don't think it's ever been a real problem that it requires keys to be hashable. groupby() looks very nice when it applies.
Or totals = dict((key, sum(group)) for key, group in groupby(keyfunc, sequence)) exploiting generator expressions too. [after Raymond wonders about cases where the consumer doesn't iterate over the group generators ]
I don't think those semantics should be implemented. You should be required to iterate through each group.
Brrrr. Sounds error-prone (hard to explain, and impossible to enforce unless the implementation does almost all the work it would need to allow groups to get skipped -- if the implementation can detect that a group hasn't been fully iterated, then it could almost as easily go on to skip over remaining equal keys itself instead of whining about it; but if the implementation can't detect it, accidental violations of the requirement will be hard to track down). You're a security guy now. You've got a log with line records of the form month day hhmmss severity_level threat_id It's sorted ascending by month then desceding by severity_level. You want a report of the top 10 threats seen each month. for month, lines in groupby(lamdba s: s.split()[0], input_file): print month print itertools.islice(lines, 10) Like array[:10], islice() does the right thing if there are fewer than 10 lines in a month. It's just not natural to require that an iterator be run to exhaustion (if it *were* natural, this wouldn't be the first context ever to require it <wink>).

The major downside of this is that this keeps everything in memory. When that's acceptable, it's a great approach (especially because it doesn't require sorting). But often you really want to be able to handle input of arbitrary size. For example, suppose you are given a file with some kind of records, timestamped and maintained in chronological order (e.g. a log file -- perfect example of data that won't fit in memory and is already sorted). You're supposed to output this for printing, while inserting a header at the start of each day and a footer at the end of each day with various counts or totals per day.
groupby() looks very nice when it applies.
Right. :-)
Nice. When can we get these? :-)
I take it back after seeing Raymond's implementation -- it's simple enough to make sure that each group is exhausted before starting the next group, and this is clearly the "natural" semantics. --Guido van Rossum (home page: http://www.python.org/~guido/)

Unless someone in the know volunteers, it will need to wait until Christmas vacation. Currently, the implementation is beyond my skill level. It will take a while raise my skills to cover adding new syntax and what to do in the compiler. Raymond

[Raymond Hettinger]
[Neil Schemenauer]
I wonder if we should try to finish the new compiler first.
That's the rub -- if I have time to move 2.4 along, I'll first give it to advancing the AST branch. Teaching the current front end new parsing tricks would be an exercise is obsolescence.
participants (4)
-
Guido van Rossum
-
Neil Schemenauer
-
Raymond Hettinger
-
Tim Peters