[Python-Dev] "groupby" iterator

Guido van Rossum guido at python.org
Thu Nov 27 12:30:33 EST 2003


In the shower (really!) I was thinking about the old problem of going
through a list of items that are supposed to be grouped by some key,
and doing something extra at the end of each group.  I usually end up
doing something ugly like this:

    oldkey = None
    for item in sequence:
        newkey = item.key # this could be any function of item
        if newkey != oldkey and oldkey is not None:
            ...do group processing...
            oldkey = newkey
        ...do item processing...
    ...do group processing... # for final group

This is ugly because the group processing code has to be written twice
(or turned into a mini-subroutine); it also doesn't handle empty
sequences correctly.  Solutions based on using an explicit index and
peeking ahead are similarly cumbersome and hard to get right for all
end cases.

So I realized this is easy to do with a generator, assuming we can
handle keeping a list of all items in a group.  Here's the generator:

    def groupby(key, iterable):
        it = iter(iterable)
        value = it.next() # If there are no items, this takes an early exit
        oldkey = key(value)
        group = [value]
        for value in it:
            newkey = key(value)
            if newkey != oldkey:
                yield group
                group = []
                oldkey = newkey
            group.append(value)
        yield group

Here's the usage ("item.key" is just an example):

    for group in groupby(lambda item: item.key, sequence):
        for item in group:
            ...item processing...
        ...group processing...

The only caveat is that if a group is very large, this accumulates all
its items in a large list.  I expect the generator can be reworked to
return an iterator instead, but getting the details worked out seems
too much work for a summy Thanskgiving morning. :-)

Example:

    # Print lines of /etc/passwd, sorted, grouped by first letter
    lines = open("/etc/passwd").readlines()
    lines.sort()
    for group in groupby(lambda s: s[0], lines):
        print "-"*10
        for line in group: print line,
    print "-"*10

Maybe Raymond can add this to the itertools module?

Or is there a more elegant approach than my original code that I've
missed all these years?

--Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-Dev mailing list