itertools.intersect?

Thu Jun 11 17:27:32 EDT 2009

On Thu, Jun 11, 2009 at 2:54 PM, Terry Reedy<tjreedy at udel.edu> wrote:
> Jack Diederich wrote:
>>
>> On Thu, Jun 11, 2009 at 12:03 AM, David M. Wilson<dw at botanicus.net> wrote:
>> [snip]
>>>
>>> I found my answer: Python 2.6 introduces heap.merge(), which is
>>> designed exactly for this.
>>
>> Thanks, I knew Raymond added something like that but I couldn't find
>> it in itertools.
>> That said .. it doesn't help.  Aside, heapq.merge fits better in
>> itertools (it uses heaps internally but doesn't require them to be
>> passed in).  The other function that almost helps is
>> itertools.groupby() and it doesn't return an iterator so is an odd fit
>> for itertools.
>>
>> More specifically (and less curmudgeonly) heap.merge doesn't help for
>> this particular case because you can't tell where the merged values
>> came from.  You want all the iterators to yield the same thing at once
>> but heapq.merge muddles them all together (but in an orderly way!).
>> Unless I'm reading your tokenizer func wrong it can yield the same
>> value many times in a row.  If that happens you don't know if four
>> "The"s are once each from four iterators or four times from one.
>
> David is looking to intersect sorted lists of document numbers with
> duplicates removed in order to find documents that contain worda and wordb
> and wordc ... .  But you are right that duplicate are a possible fly in the
> ointment to be removed before merging.

Ah, in that case the heap.merge solution is both useful and succinct:

import heapq
import itertools
def intersect(its):
    source = heapq.merge(*its)
    while True:
        sames = [source.next()]
        sames.extend(itertools.takewhile(lambda v:v == sames[0], source))
        if len(sames) == len(its):
            yield sames[0]
    return

-Jack