[Python-ideas] Possible new itertool: comm()

Paul Moore p.f.moore at gmail.com
Tue Jan 6 17:22:23 CET 2015


In writing a utility script today, I found myself needing to do
something similar to what the Unix "comm" utility does - take two
sorted iterators, and partition the values into "only in the first",
"only in the second", and "in both" groups.

I couldn't find an obvious implementation around, so I ended up having
to write my own (see below).

Would this be something useful to add to the itertools module? The
reasons I think it might be worth including in the stdlib are:

1. The implementation is somewhat tricky to get right, especially the
edge cases (I'm not 100% sure my implementation is completely right in
this regard). So implementing it yourself each time is error-prone and
time consuming.
2. The times I've needed this have been ad-hoc scripts (I'm on
Windows, so while a Unix user might use a quick shell pipeline with
comm, that's less convenient for me) where depending on a 3rd party
distribution from PyPI is less ideal.

There's plenty of room for API bikeshedding here, but the basic
algorithm remains the same.

What do people think?

Paul

def comm(it1, it2):
    """Partition 2 sorted iterators into common, only in 1st, and only in 2nd.

    >>> list(comm([1,2,3], [2,3,4]))
    [('<', 1), ('=', 2), ('=', 3), ('>', 4)]
    >>> list(comm([0,1,2,3], [2,3,4,5]))
    [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)]
    >>> list(comm([0,1,2,3,6], [2,3,4,5]))
    [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)]
    >>> list(comm([0,1], []))
    [('<', 0), ('<', 1)]
    >>> list(comm([], [0,1]))
    [('>', 0), ('>', 1)]
    >>> list(comm([0,1], [0,1]))
    [('=', 0), ('=', 1)]
    >>> list(comm([], []))
    []
    """

    it1 = iter(it1)
    it2 = iter(it2)

    it1_finished = False
    it2_finished = False
    try:
        v1 = next(it1)
    except StopIteration:
        it1_finished = True
    try:
        v2 = next(it2)
    except StopIteration:
        it2_finished = True
    while not (it1_finished or it2_finished):
        if v1 < v2:
            yield ('<', v1)
            try:
                v1 = next(it1)
            except StopIteration:
                it1_finished = True
        elif v1 > v2:
            yield ('>', v2)
            try:
                v2 = next(it2)
            except StopIteration:
                it2_finished = True
        else:
            yield ('=', v1)
            try:
                v1 = next(it1)
            except StopIteration:
                it1_finished = True
            try:
                v2 = next(it2)
            except StopIteration:
                it2_finished = True

    if it1_finished and not it2_finished:
        yield ('>', v2)
        for v2 in it2:
            yield ('>', v2)
    if it2_finished and not it1_finished:
        yield ('<', v1)
        for v1 in it1:
            yield ('<', v1)

if __name__ == '__main__':
    import doctest
    doctest.testmod()


More information about the Python-ideas mailing list