[Python-Dev] Efficient set complement and operation on large/infinite sets.

Fri May 12 00:55:16 CEST 2006

I'm about to write some code to manage sets, and wanted to float a few
thoughts here because I have various ideas about how to implement what I
want to do, and I think one of them could be done by changing Python's set
type in useful and backward compatible way.

Apologies if this is discussed in the archives. I didn't see it.

My original motivation is a need to work efficiently with very large sets,
let's say 10M elements. In some cases, you can operate on a set without
needing to explicitly represent all its elements. For example, if my set is
the Universal (U) set, then X in U, should return True. If my set is U and
I union it with any other set, it is unchanged, etc. By also maintaining a
finite set of things that are _not_ in my large set, I can do other
operations efficiently. E.g., calling U.remove(X) would _add_ X to the set
of elements that were known not to be in the big set that you were trying
to represent efficiently. In most cases, such a set class would simply do
the opposite/inverse operation on its finite exclusion set.

While it's convenient to warm up by talk about the infinite Universal set,
we could just as easily be talking about arbitrarily large finite sets.
There are some examples below.

Another way to discuss the same thing is to talk about set complements. It
would be nice to be able to complement (or invert) a set. Once you did so,
you might have an infinite set on your hands, but as the last paragraph
argues, you can still operate on an infinite set efficiently. Naturally,
you can't fully enumerate it, or take its size.

I think that giving Python sets the ability to handle complementarity would
have some nice uses - and that these go well beyond the particular thing
that I need right now.

For example, suppose you want to represent the set of all floating point
numbers. You just create an empty set and complement it. Then you can test
it as usual, and begin to exclude things from it. Or if you have a system
with 10M objects and you need to support operations on these objects via a
query language that lets the user type "not X" where X is a set of objects,
there is a natural and efficient way (constant time) to execute the query -
you just mark the set as being inverted (complemented).

If I were going to implement the above by changing Python's built in set
type, here's what I think I'd do:

Add four optional keyword arguments to the set constructor:

  - invert=False
  - universalSet=None
  - universalSetSize=None
  - universeExclusionFunc=None

invert is just a flag, indicating the sense of the set.

If inverse is False:

  the set behaves exactly as a normal Python set and the other three
  new arguments are ignored.

If inverse is True:

  universalSet represents the universal set. If it is None, then the
  universal set is infinite. Otherwise, universalSet is an iterable. The
  implementation should not call this iterable unless it's unavoidable, on
  the presumption that if the programmer wanted to operate directly on the
  contents of this iterable, they'd be doing it in a conventional fashion
  (e.g., with a normal set). The universalSetSize, used for len()
  calculations, is the number of elements in universalSet, if known &
  if finite.

  The universeExclusionFunc can be called with a single element argument to
  see if the element should be considered excluded from the universalSet.
  This may seem like a weird idea, but it's the key to flexibility and
  efficiency. In an invert=True set, the code would use the normal set
  content as a mutable set of objects that are not in the universe, as
  described above, but would, in addition, use the universeExclusionFunc
  (if defined) to identify elements not in the set (because they are not in
  the universe), and thus avoid the use of the expensive (or infinite)
  universalSet.

Note that an inverted (or normal) set can be inverted by simply setting
invert=False, so this requires a new invert() method (which could be
triggered by the use of something like 'not' or '!' or '~'). In this case,
an inverted set becomes a normal Python set. The elements represented by
universeExclusionFunc remain invalid - they are simply not in the universe
(and, if deemed sensible, this could be sanity checked in add()).

If it's not clear, when a set is inverted, any iterable given to __init__,
(i.e., the iterable argument in the normal case of constructing a Python
set), is just treated as usual (but in this case will be taken to
represent things not in the set initially).

Here are some examples of usage:

1. Suppose you want to work with the set of integers [0, 100000000) and
   that initially your set is all such integers. You could create this set
   via:

       S = set(invert=True,
               universalSet=xrange(100000000),
               universalSetSize=100000000,
               universeExclusionFunc=(lambda x: x >= 100000000)

   This has the intended effect, is efficient, and no-one need call the
   iterator. You can (I think) do everything you could do with a normal set
   (including inverting it). If the user happens to want to list all the
   elements in the set, that's their business and they can do it, except in
   the case where universalSet=None, in which case I would raise an
   exception.

2. Suppose you wanted to represent the set of all floating point numbers
   > 0.0 that were not integers. You'd do this via:

       def excludeFunc(f):
           try:
               f = float(f)
           except ValueError:
               return True
           if f <= 0.0:
               return True
           if f - math.floor(f) == 0.0:
               return True
           return False

       S = set(invert=True, universeExclusionFunc=excludeFunc)

3. Let's say you wanted to work with the set of all IP addresses, as
   represented by lists of 4 integers, but that you want to exclude
   127.0.0.1, and all the reserved RFC1918 address blocks

       def excludeFunc(ip):
           # Insert code to check type and length.
           if (ip == [127, 0, 0, 1]):
               return True
           if (ip[0] == 10 or 
               (172, 16, 0, 0) <= ip <= (172, 31, 255, 255) or
               (191, 168, 0, 0) <= ip <= (192, 168, 255, 255)):
               return True
           # Insert code to check all bytes in [0..255], return True if not.
           return False

       S = set(invert=True, universeExclusionFunc=excludeFunc)

It's pretty easy to think of other examples of wanting to efficiently use
big sets, or to use small sets and then needing to complement them and
operate efficiently on their large complements.

One reaction to these examples might be that it's no big deal to write your
own code to, say, manage sets of IP addresses. I think that's not so easy
to do efficiently if the set is massive (and you'd end up keeping track of
the complement or using offline storage). Doing it as suggested above makes
it simple, and once you call the constructor, you just do normal Python set
operations.

This suggestion goes a little way towards giving "continuous" sets. One way
to categorize set support is:

  A. Only support finite sets.

  B. Provide category A but also support infinite sets where at least
     one of the set and its complement is finite.

  C. Provide category B but also allow for the case where a set and its
     complement may both be infinite.

Thus I am proposing something that would take set support in Python from
category A to category B.  Implementing B efficiently is provably possible
since all operations are done on either a set or its complement, whichever
is known to be finite, and Python already handles those operations (via its
category A support).

The downsides that I see at the moment are:

1. Two new exceptions that can arise. These only happen on sets that are in
   the inverted state, and only when the universal set is infinite:

     - You try to enumerate an infinite set.
     - You call len() on an infinite set.

2. Operations on regular sets get slightly slower because virtually all set
   operations would need to test the invert flag. So that's an extra
   Boolean test on every set operation.

I don't think the implementation is too hard - the Python sets.py code is
very clean - though there may be gotchas.

Anyway, I wanted to see what people think of this before I go ahead and
solve my actual problem. If the above is considered potentially worthwhile
to Python, I can work on adapting sets.py. If not, I'll do the faster and
simpler thing and write my own class.

Regards,
Terry