[Python-Dev] PEP-0218

Tim Peters tim_one@email.msn.com
Wed, 29 Nov 2000 17:10:34 -0500

[Moshe Zadka]
> ...
> Let me note that almost everything Greg Wilson wants to do can be done
> via a Python class implementing a set using a dictionary mapping to None.
> Almost?
> * No builitin syntax: import Set;Set(1,2,3) instead of {1,2,3}
> * Convertors: if we want list/tuple to have a semblance of efficiency,
>   we'll need to cache the element list as a list when accessed by
>   index.

Like everyone else, I have a Set class too.  Its __getitem__ is

    # so "for x in set:" works
    # see also set.tolist()
    def __getitem__(self, i):
        if i == 0:
            self.keys = self.d.keys()
        return self.keys[i]

Of course I don't *want* set[i] to mean anything at all; this is just a hack
to work with the current iteration protocol (i.e., Set.__getitem__ makes no
sense except in that it must be implemented so that "for x in set" works
today).  The one thing I could never get used to in Aaron's kjbuckets
implementation of sets is that for/in did not work as expected.

> * Two different constructors: set() for building from sequence, Set()
>   for building from elements. Might be confusing.

My Set.__init__:

    def __init__(self, seq=[]):
        self.d = d = {}
        for x in seq:
            d[x] = 1

That is, the constructor takes a sequence.  Same thing in Icon, which added
a set type late in the game.  No problem in practice.

> * Only possible elements of a set are immutables. OTOH, I'm not sure
>   how Greg intends to implement his sets if these sets are allowed
>   to contain mutable elements.

This needs to be addressed, but several approaches are possible.  For
example, my Set class allows for Sets of Sets (& so on), because I needed
them, and Sets are certainly mutable.

Now immutable objects are required because dicts require immutable objects
as keys.  However, a class (like Set!) can fool dicts, by supplying a
__hash__ and a __cmp__ "that work".  Dicts can be compared directly, so
__cmp__ is no problem.  The difficulty comes with the __hash__.  My solution
was to "freeze" a Set the instant __hash__ was invoked:  this allows
mutation up until the point a hash is captured, and disallows it thereafter.

    def __hash__(self):
        if self.frozen:
            hashcode = self.hashcode
            # The hash code must not depend on the order of the
            # keys.
            self.frozen = 1
            hashcode = 0
            _hash = hash
            for x in self.d.keys():
                hashcode = hashcode ^ _hash(x)
            self.hashcode = hashcode
        return hashcode

and all the mutating methods check self.frozen, raising ValueError if the
target set is currently frozen.  For example,

    # updating union
    def unionu(self, other):
        if self.frozen:
            raise ValueError(_frozenmsg)

Oddly enough, I almost never got a "frozen" error in practice; it seems that
by the time I was ready to build a set of sets, the constituent sets were at
their final values; and the few times I did get a frozen error, it was
actually a logic bug (I was mutating the set in error).  It's hard to guess
whether that's unique to me <0.5 wink>.

Icon has a different approach entirely:  an Icon set acts like a Python dict
*would* act if you inserted the id()s of the keys rather than the keys
themselves; that is, Icon's sets (and dicts) are based on object identity,
not object value.  Since object identity is invariant regardless of whether
the object is mutable, a hash table implementation has no difficulties.

I find identity semantics for sets less useful than value semantics, though.

[Eric Raymond]
> ...
> Within a few hours after I see [rich comparisons] in the beta I'll
> have a very rich Set class for the standard library.  It includes
> all the standard boolean operations, symmetric difference, powerset,
> and useful overloads for most of the standard operators.  As the
> header comment says:
> # A set-algebra module for Python
> #
> # The functions work on any sequence type and return lists.
> # The set methods can take a set or any sequence type as an argument.
> ...

I'm afraid this is a good example of why a set type is unlikely to make it
into the std distribution:  whenever a data structure is added to Python,
the agitation for variations is endless.  For example, returning a list
makes your flavor of sets unsuitable (because inefficient) for many
applications (e.g., there's no efficient way to test a list for element
membership).  "Almost all" set implementations I've seen for Python use
dicts for that reason.

Another set of complaints will be due to some people wanting functional
versions of the set operations, while others want mutating versions.  My own
Set class supplies both, because both are really needed in practice; e.g.,
the updating union was shown above; the functional union builds on that:

    # functional union
    def union(self, other):
        answer = self.__copy__()
        return answer

(subtlety:  the functional union has no problem with "frozen" sets;
Set.__copy__ always returns a melted set).

Since it's Greg's PEP, it's up to him to make everyone happy <wink> -- but
since there's so much variation possible, I'm not sure Guido will ever bless
any Set class as "the" Set class.

BTW, one lesson to take from SETL:  a vital set operation in practice is a
mutating "pick a 'random' element E from the set, remove it from the set,
and return it".  An enormous number of set algorithms are of the form

    while not S.empty():
        pick some element from S
        deal with it, possibly mutating S

This operation can't be done efficiently in Python code if the set is
represented by a dict (the best you can do is materialize the full list of
keys first, and pick one of those).  That means my Set class often takes
quadratic time for what *should* be linear-time algorithms.

    in-python-today-ly y'rs  - tim