[Python-Dev] Efficient set complement and operation on large/infinite sets.

Fri May 12 01:29:15 CEST 2006

Hm...  Without reading though all this, I expect that you'd be better
off implementing this for yourself without attempting to pull the
standard library sets into the picture (especially since sets.py is
obsolete as of 2.4; set and frozenset are now built-in types).  You're
really after rather specialized set representations. I've done this
myself and as long as you stick to solving *just* the problem at hand,
it's easy. If you want a general solution, it's hard...

--Guido

On 5/11/06, Terry Jones <terry at jon.es> wrote:
> I'm about to write some code to manage sets, and wanted to float a few
> thoughts here because I have various ideas about how to implement what I
> want to do, and I think one of them could be done by changing Python's set
> type in useful and backward compatible way.
>
> Apologies if this is discussed in the archives. I didn't see it.
>
> My original motivation is a need to work efficiently with very large sets,
> let's say 10M elements. In some cases, you can operate on a set without
> needing to explicitly represent all its elements. For example, if my set is
> the Universal (U) set, then X in U, should return True. If my set is U and
> I union it with any other set, it is unchanged, etc. By also maintaining a
> finite set of things that are _not_ in my large set, I can do other
> operations efficiently. E.g., calling U.remove(X) would _add_ X to the set
> of elements that were known not to be in the big set that you were trying
> to represent efficiently. In most cases, such a set class would simply do
> the opposite/inverse operation on its finite exclusion set.
>
> While it's convenient to warm up by talk about the infinite Universal set,
> we could just as easily be talking about arbitrarily large finite sets.
> There are some examples below.
>
> Another way to discuss the same thing is to talk about set complements. It
> would be nice to be able to complement (or invert) a set. Once you did so,
> you might have an infinite set on your hands, but as the last paragraph
> argues, you can still operate on an infinite set efficiently. Naturally,
> you can't fully enumerate it, or take its size.
>
> I think that giving Python sets the ability to handle complementarity would
> have some nice uses - and that these go well beyond the particular thing
> that I need right now.
>
> For example, suppose you want to represent the set of all floating point
> numbers. You just create an empty set and complement it. Then you can test
> it as usual, and begin to exclude things from it. Or if you have a system
> with 10M objects and you need to support operations on these objects via a
> query language that lets the user type "not X" where X is a set of objects,
> there is a natural and efficient way (constant time) to execute the query -
> you just mark the set as being inverted (complemented).
>
> If I were going to implement the above by changing Python's built in set
> type, here's what I think I'd do:
>
> Add four optional keyword arguments to the set constructor:
>
>   - invert=False
>   - universalSet=None
>   - universalSetSize=None
>   - universeExclusionFunc=None
>
> invert is just a flag, indicating the sense of the set.
>
> If inverse is False:
>
>   the set behaves exactly as a normal Python set and the other three
>   new arguments are ignored.
>
> If inverse is True:
>
>   universalSet represents the universal set. If it is None, then the
>   universal set is infinite. Otherwise, universalSet is an iterable. The
>   implementation should not call this iterable unless it's unavoidable, on
>   the presumption that if the programmer wanted to operate directly on the
>   contents of this iterable, they'd be doing it in a conventional fashion
>   (e.g., with a normal set). The universalSetSize, used for len()
>   calculations, is the number of elements in universalSet, if known &
>   if finite.
>
>   The universeExclusionFunc can be called with a single element argument to
>   see if the element should be considered excluded from the universalSet.
>   This may seem like a weird idea, but it's the key to flexibility and
>   efficiency. In an invert=True set, the code would use the normal set
>   content as a mutable set of objects that are not in the universe, as
>   described above, but would, in addition, use the universeExclusionFunc
>   (if defined) to identify elements not in the set (because they are not in
>   the universe), and thus avoid the use of the expensive (or infinite)
>   universalSet.
>
> Note that an inverted (or normal) set can be inverted by simply setting
> invert=False, so this requires a new invert() method (which could be
> triggered by the use of something like 'not' or '!' or '~'). In this case,
> an inverted set becomes a normal Python set. The elements represented by
> universeExclusionFunc remain invalid - they are simply not in the universe
> (and, if deemed sensible, this could be sanity checked in add()).
>
> If it's not clear, when a set is inverted, any iterable given to __init__,
> (i.e., the iterable argument in the normal case of constructing a Python
> set), is just treated as usual (but in this case will be taken to
> represent things not in the set initially).
>
>
> Here are some examples of usage:
>
> 1. Suppose you want to work with the set of integers [0, 100000000) and
>    that initially your set is all such integers. You could create this set
>    via:
>
>        S = set(invert=True,
>                universalSet=xrange(100000000),
>                universalSetSize=100000000,
>                universeExclusionFunc=(lambda x: x >= 100000000)
>
>    This has the intended effect, is efficient, and no-one need call the
>    iterator. You can (I think) do everything you could do with a normal set
>    (including inverting it). If the user happens to want to list all the
>    elements in the set, that's their business and they can do it, except in
>    the case where universalSet=None, in which case I would raise an
>    exception.
>
> 2. Suppose you wanted to represent the set of all floating point numbers
>    > 0.0 that were not integers. You'd do this via:
>
>        def excludeFunc(f):
>            try:
>                f = float(f)
>            except ValueError:
>                return True
>            if f <= 0.0:
>                return True
>            if f - math.floor(f) == 0.0:
>                return True
>            return False
>
>        S = set(invert=True, universeExclusionFunc=excludeFunc)
>
> 3. Let's say you wanted to work with the set of all IP addresses, as
>    represented by lists of 4 integers, but that you want to exclude
>    127.0.0.1, and all the reserved RFC1918 address blocks
>
>        def excludeFunc(ip):
>            # Insert code to check type and length.
>            if (ip == [127, 0, 0, 1]):
>                return True
>            if (ip[0] == 10 or
>                (172, 16, 0, 0) <= ip <= (172, 31, 255, 255) or
>                (191, 168, 0, 0) <= ip <= (192, 168, 255, 255)):
>                return True
>            # Insert code to check all bytes in [0..255], return True if not.
>            return False
>
>        S = set(invert=True, universeExclusionFunc=excludeFunc)
>
> It's pretty easy to think of other examples of wanting to efficiently use
> big sets, or to use small sets and then needing to complement them and
> operate efficiently on their large complements.
>
> One reaction to these examples might be that it's no big deal to write your
> own code to, say, manage sets of IP addresses. I think that's not so easy
> to do efficiently if the set is massive (and you'd end up keeping track of
> the complement or using offline storage). Doing it as suggested above makes
> it simple, and once you call the constructor, you just do normal Python set
> operations.
>
> This suggestion goes a little way towards giving "continuous" sets. One way
> to categorize set support is:
>
>   A. Only support finite sets.
>
>   B. Provide category A but also support infinite sets where at least
>      one of the set and its complement is finite.
>
>   C. Provide category B but also allow for the case where a set and its
>      complement may both be infinite.
>
> Thus I am proposing something that would take set support in Python from
> category A to category B.  Implementing B efficiently is provably possible
> since all operations are done on either a set or its complement, whichever
> is known to be finite, and Python already handles those operations (via its
> category A support).
>
>
> The downsides that I see at the moment are:
>
> 1. Two new exceptions that can arise. These only happen on sets that are in
>    the inverted state, and only when the universal set is infinite:
>
>      - You try to enumerate an infinite set.
>      - You call len() on an infinite set.
>
> 2. Operations on regular sets get slightly slower because virtually all set
>    operations would need to test the invert flag. So that's an extra
>    Boolean test on every set operation.
>
> I don't think the implementation is too hard - the Python sets.py code is
> very clean - though there may be gotchas.
>
> Anyway, I wanted to see what people think of this before I go ahead and
> solve my actual problem. If the above is considered potentially worthwhile
> to Python, I can work on adapting sets.py. If not, I'll do the faster and
> simpler thing and write my own class.
>
> Regards,
> Terry
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
>

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)