David Wilson dw at
Thu Jun 11 00:24:13 CEST 2009


During a fun coding session yesterday, I came across a problem that I
thought was already solved by itertools, but on investigation it seems
it isn't.

The problem is simple: given one or more ordered sequences, return
only the objects that appear in each sequence, without reading the
whole set into memory. This is basically an SQL many-many join.

I thought it could be accomplished through recursively embedded
generators, but that approach failed in the end. After posting the
question to Stack Overflow[0], Martin Geisler proposed a wonderfully
succinct and reusable solution (see below, or pretty printed at the
Stack Overflow URL).

It is my opinion that this particular implementation is a wonderful
and incredibly valid use of iterators, and something that could be
reused by others, certainly least not myself again in the future. With
that in mind I thought it, or something very similar, would be a great
addition to the itertools module.

My question then is, are there better approaches to this? The heapq-
based solution at the Stack Overflow page is potentially more useful
still, for its ability to operate on orderless sequences, but in that
case, it might be better to simply listify each sequence, and sort it
before passing to the ordered-only functions.



Stack Overflow page here:

Sweet solution:

    import operator

    def intersect(sequences):
        """Compute intersection of sequences of increasing integers.

        >>> list(intersect([[1,   100, 142, 322, 12312],
        ...                 [2,   100, 101, 322, 1221],
        ...                 [100, 142, 322, 956, 1222]]))
        [100, 322]

        iterators = [iter(seq) for seq in sequences]
        last = [ for iterator in iterators]
        indices = range(len(iterators))
        while True:
            # The while loop stops when StopIteration is raised. The
            # exception will also stop the iteration by our caller.
            if reduce(operator.and_, [l == last[0] for l in last]):
                # All iterators contain last[0]
                yield last[0]
                last = [ for iterator in iterators]

            # Now go over the iterators once and advance them as
            # necessary. To stop as soon as the smallest iterator we
            # advance each iterator only once per loop iteration.
            for i in indices[:-1]:
                if last[i] < last[i+1]:
                    last[i] = iterators[i].next()
                if last[i] > last[i+1]:
                    last[i+1] = iterators[i+1].next()

More information about the Python-list mailing list