itertools.intersect?
David M. Wilson
dw at botanicus.net
Thu Jun 11 00:03:10 EDT 2009
On Jun 10, 11:24 pm, David Wilson <d... at botanicus.net> wrote:
> Hi,
>
> During a fun coding session yesterday, I came across a problem that I
> thought was already solved by itertools, but on investigation it seems
> it isn't.
>
> The problem is simple: given one or more ordered sequences, return
> only the objects that appear in each sequence, without reading the
> whole set into memory. This is basically an SQL many-many join.
>
> I thought it could be accomplished through recursively embedded
> generators, but that approach failed in the end. After posting the
> question to Stack Overflow[0], Martin Geisler proposed a wonderfully
> succinct and reusable solution (see below, or pretty printed at the
> Stack Overflow URL).
>
> It is my opinion that this particular implementation is a wonderful
> and incredibly valid use of iterators, and something that could be
> reused by others, certainly least not myself again in the future. With
> that in mind I thought it, or something very similar, would be a great
> addition to the itertools module.
>
> My question then is, are there better approaches to this? The heapq-
> based solution at the Stack Overflow page is potentially more useful
> still, for its ability to operate on orderless sequences, but in that
> case, it might be better to simply listify each sequence, and sort it
> before passing to the ordered-only functions.
>
> Thanks,
>
> David.
>
> Stack Overflow page here:
>
> http://stackoverflow.com/questions/969709/joining-a-set-of-ordered-in...
>
> Sweet solution:
>
> import operator
>
> def intersect(sequences):
> """Compute intersection of sequences of increasing integers.
>
> >>> list(intersect([[1, 100, 142, 322, 12312],
> ... [2, 100, 101, 322, 1221],
> ... [100, 142, 322, 956, 1222]]))
> [100, 322]
>
> """
> iterators = [iter(seq) for seq in sequences]
> last = [iterator.next() for iterator in iterators]
> indices = range(len(iterators))
> while True:
> # The while loop stops when StopIteration is raised. The
> # exception will also stop the iteration by our caller.
> if reduce(operator.and_, [l == last[0] for l in last]):
> # All iterators contain last[0]
> yield last[0]
> last = [iterator.next() for iterator in iterators]
>
> # Now go over the iterators once and advance them as
> # necessary. To stop as soon as the smallest iterator we
> # advance each iterator only once per loop iteration.
> for i in indices[:-1]:
> if last[i] < last[i+1]:
> last[i] = iterators[i].next()
> if last[i] > last[i+1]:
> last[i+1] = iterators[i+1].next()
I found my answer: Python 2.6 introduces heap.merge(), which is
designed exactly for this.
Thanks all,
David.
More information about the Python-list
mailing list