fastest way to find the intersection of n lists of sets

Sun Apr 29 20:48:36 EDT 2007

On Apr 30, 5:08 am, John Nagle <n... at animats.com> wrote:
> Prateek wrote:
> >>     For the above example, it's worth sorting lists_of_sets by the
> >>length of the sets, and doing the short ones first.
>
> > Thanks. I thought so - I'm doing just that using a simple Decorate-
> > Sort-Undecorate idiom.
>
> >>     How big are the sets?  If they're small, but you have a lot of
> >>them, you may be better off with a bit-set representation, then
> >>using AND operations for intersection.  If they're huge (tens of millions
> >>of entries), you might be better off doing sorts and merges on the
> >>sets.
>
> > I have either 2 or 3 sets (never more) which can be arbitrarily large.
> > Most of them are small (between 0 and few elements - say less that 5).
> > A few will be megamonstrous ( > 100,000 items)
>
> >>     When you ask questions like this, it's helpful to give some
> >>background.  We don't know whether this is a homework assignment, or
> >>some massive application that's slow and you need to fix it, even
> >>if it requires heavy implementation effort.
>
> > Its definitely not a homework assignment - its part of a commercial
> > database query engine. Heavy implementation effort is no problem.
>
> > Prateek
>
>     If you're intersecting a set of 5 vs a set of 100,000, the
> intersection time won't be the problem.  That's just five lookups.
> It's building a set of 100,000 items that may be time consuming.
>
>     Does the big set stay around for a while, or do you have to pay
> that cost on each query?
>
>     Those really aren't big data sets on modern machines.
>
>                                         John Nagle

100,000 is an arbitrary number - that is potentially equivalent to the
number of unique cells in all tables of a typical database (thats the
best analogy I can come up with since this isn't a typical RDBMS).

The big set does stay around for a while - I've implemented an LRU
based caching algorithm on the code that does the I/O. Since the db is
transactioned, I keep one copy in the current transaction cache (which
is a simple dictionary) and one in the main read cache (LRU based)
(which obviously survives across transactions). Since the largest sets
also tend to be the most frequently used, this basically solves my I/O
caching issue.

My problem is that I had ugly code doing a lot of unnecessary list <->
set casting. Here is what I've got now:

from itertools import chain
ids1 = [...], ids2 = [...], ids3 = [...]
_efs = frozenset()
# dbx.get calls return sets
l1 = frozenset(chain(*[db1.get(x, _efs) for x in ids1])
l2 = frozenset(chain(*[db2.get(x, _efs) for x in ids2])
l3 = frozenset(chain(*[db3.get(x, _efs) for x in ids3])

decorated = [(len(x), x) for x in [l1, l2, l3]]
decorated.sort()
result = reduce(set.intersection, [x[1] for x in decorated])

What do you think?

Prateek