jackdied at gmail.com
Thu Jun 11 17:27:32 EDT 2009
On Thu, Jun 11, 2009 at 2:54 PM, Terry Reedy<tjreedy at udel.edu> wrote:
> Jack Diederich wrote:
>> On Thu, Jun 11, 2009 at 12:03 AM, David M. Wilson<dw at botanicus.net> wrote:
>>> I found my answer: Python 2.6 introduces heap.merge(), which is
>>> designed exactly for this.
>> Thanks, I knew Raymond added something like that but I couldn't find
>> it in itertools.
>> That said .. it doesn't help. Aside, heapq.merge fits better in
>> itertools (it uses heaps internally but doesn't require them to be
>> passed in). The other function that almost helps is
>> itertools.groupby() and it doesn't return an iterator so is an odd fit
>> for itertools.
>> More specifically (and less curmudgeonly) heap.merge doesn't help for
>> this particular case because you can't tell where the merged values
>> came from. You want all the iterators to yield the same thing at once
>> but heapq.merge muddles them all together (but in an orderly way!).
>> Unless I'm reading your tokenizer func wrong it can yield the same
>> value many times in a row. If that happens you don't know if four
>> "The"s are once each from four iterators or four times from one.
> David is looking to intersect sorted lists of document numbers with
> duplicates removed in order to find documents that contain worda and wordb
> and wordc ... . But you are right that duplicate are a possible fly in the
> ointment to be removed before merging.
Ah, in that case the heap.merge solution is both useful and succinct:
source = heapq.merge(*its)
sames = [source.next()]
sames.extend(itertools.takewhile(lambda v:v == sames, source))
if len(sames) == len(its):
More information about the Python-list