itertools.intersect?

Mensanator mensanator at aol.com
Thu Jun 11 18:23:46 EDT 2009


On Jun 11, 1:54 pm, Terry Reedy <tjre... at udel.edu> wrote:
> Jack Diederich wrote:
> > On Thu, Jun 11, 2009 at 12:03 AM, David M. Wilson<d... at botanicus.net> wrote:
> > [snip]
> >> I found my answer: Python 2.6 introduces heap.merge(), which is
> >> designed exactly for this.
>
> > Thanks, I knew Raymond added something like that but I couldn't find
> > it in itertools.
> > That said .. it doesn't help.  Aside, heapq.merge fits better in
> > itertools (it uses heaps internally but doesn't require them to be
> > passed in).  The other function that almost helps is
> > itertools.groupby() and it doesn't return an iterator so is an odd fit
> > for itertools.
>
> > More specifically (and less curmudgeonly) heap.merge doesn't help for
> > this particular case because you can't tell where the merged values
> > came from.  You want all the iterators to yield the same thing at once
> > but heapq.merge muddles them all together (but in an orderly way!).
> > Unless I'm reading your tokenizer func wrong it can yield the same
> > value many times in a row.  If that happens you don't know if four
> > "The"s are once each from four iterators or four times from one.
>
> David is looking to intersect sorted lists of document numbers with
> duplicates removed in order to find documents that contain worda and
> wordb and wordc ... .  But you are right that duplicate are a possible
> fly in the ointment to be removed before merging.

Removing the duplicates could be a big problem.

With SQL, the duplicates need not have to be removed.
All I have to do is change "SELECT" to "SELECT DISTINCT"
to change

100 100 100 322 322 322 322 322 322 322 322

into

100 322






More information about the Python-list mailing list