"in" operator for strings

Thu Feb 1 10:26:43 EST 2001

<luthi at vaw.baug.ethz.ch> wrote in message
news:yfmwvbad25f.fsf at collet.i-did-not-set--mail-host-address--so-shoot-me...
> "Magnus Lie Hetland" <mlh at idi.ntnu.no> writes:
>
> > But you can't do what you ask for, just like you can't write
> >
> >   [1, 2] in [1, 2, 3, 4]
>
> This is exactly one of my current puzzles: How to can I compare that a set
is
> a subset of another, larger, set efficiently? Say I have to perform such a

Are you actually dealing with sets, and subsets, or lists, lists of lists,
or maybe lists of sets?

> comparison some 100000 times, it would come in very handy to write
>
> (assume all_sets is a list of 100000 sets (i.e. lists))
>
> for set in all_sets:
> if [1,2,50] in set:
> do_something
>
> any good suggestion?

If you need map your sets as lists, then you aren't going to
get substantially better behavior than by:

def is_subset(small_set, big_set):
    for item in small_set:
        if item not in big_set: return 0
    return 1

That's gonna be SLOW -- but that depends on the representation
you've chosen for sets.

At the other extreme, if your sets were represented as
bit-vectors (feasible iff their items are or can be efficiently
mapped to small integers), then you might get VERY good
performance.  I've had occasion to use the GMP bit-oriented
functions (which gmpy exposes) for that -- sets of integers
in a reasonably compact range such as [0,1024), with each
set being a '1024-bits vector' (just 128 bytes); what gmpy
exposes could quite easily be wrapped up for such purposes
(right now it's somewhat un-handy because, sigh, "numbers
are immutable" has BIG performance implications for numbers
used as bit-vectors...!).

If it's unfeasible to map your sets' items to a reasonably
compact range of integers, and so a bit-vector approach is
out, then dictionaries are probably your best bet (although,
depending on exact usage patterns, the Standard C++ Library
approach of sets-as-red-black-trees will have occasional wins
in the performance arena, too).  Wrapping your dictionary
into a class, so you can have the latter implement
__contains__ etc and get nice syntax sugar, is not too hard
(but nice syntax sugar is rarely the key issue when one is
dealing with hundreds of thousands of sets -- survivable
performance tends to be higher-priority in such cases:-).

The subset-test approach doesn't change much with a dict
implementation, though -- it's just that each membership
test it uses is a bit faster.

Alex