[Python-Dev] PEP 218 (sets); moving set.py to Lib

Guido van Rossum guido@python.org
Fri, 16 Aug 2002 17:54:53 -0400

(I know there's a sets list at SF, but SF refuses mail from this
machine, and that list is essentially dead.)

In CVS at python/nondist/sandbox/sets/ you'll find a module set.py
(and its unit tests) that implement a versatile Set class roughly
according to PEP 218.  This code has three authors: Greg V. Wilson
wrote the first version; Alex Martelli changed the strategy around
(im)mutability and removed inheritance from dict in favor of having a
dict attribute; and I cleaned it up and changed some implementation
approaches to what I think are faster algorithms (at the cost of
somewhat more verbose code), also changing the API a little bit.

I'd like to do a little more cleanup and then move it into the
standard library.  The API is roughly that of PEP 218, although it
supports several methods and operations that aren't listed in the PEP,
such as comparisons and subset/superset tests.

I'm not sure whether I want to go fix the PEP to match this
implementation, or whether to skip that step and simply document the
module in the standard library docs (that has to be done anyway).

The biggest difference with the PEP is a different approach to the
(im)mutability issue.  This problem is that on the one hand, sets are
containers that hold a potentially large number of elements, and as
such they ought to be mutable: you should be able to start with an
empty set and then add elements to it one at a time.  On the other
hand, the only efficient implementation strategy is to represent a set
internally as the keys of a dictionary whose values are ignored.  This
way insertion and removal can be done very efficiently, at the cost of
a restriction: set elements must be immutable.  The implication is
that you can't have sets of sets -- but occasionally those are handy!

The PEP proposed a strategy based on "freezing" a set as soon as it is
incorporated into another set; but in practice this didn't work out
very well, in part because the test 's1 in s2' would cause s1 to be
frozen: its hash value has to be computed to implement the membership
test, and computing the hash value signals freezing the set.  This
caused too many surprises.

Alex Martelli implemented a different strategy (see
python.org/sf/580995): there are two set classes, (mutable) Set and
ImmutableSet.  They derive from a common base class (BaseSet, an
abstract class) that implements most set operations.  ImmutableSet
adds __hash__, and Set adds mutating operations like update(),
clear(), in-place union, and so on.  The ImmutableSet constructor
accepts a sequence or another set as its argument.  While this is an
easy enough way to construct an immutable set from a mutable set, for
added convenience, if a mutable set is added to another set (except in
the constructor), an immutable shallow copy is automatically made
under the covers and added instead.  Also, when 's1 in s2' is
requested and s1 is a mutable set, it is wrapped in a
temporarily-immutable wrapper class (an internal class that is not
exposed) which compares equal to s1 and has a hash value equal to the
hash value that would be computed for an immutable copy of s1.  The
temporary wrapper does not make a copy; none of the code here is
thread-safe anyway, so if multiple threads are going to share a
mutable set, they will have to use their own locks.

I deviated from the original API in a few places:

- I renamed the method is_subset_of() to issubset(), since I don't
  like underscores in method names, and think leaving the 'of' off
  will cause ambiguity; I also added issuperset().

- I renamed intersect() to intersection() and sym_difference() to
  symmetric_difference(); ditto for the corresponding _update()

- Alex's code explicitly disallowed in-place operators (e.g. __ior__
  meaning in-place union) for immutable sets.  I decided against this;
  instead, when s1 is a variable referencing an immutable set, the
  statement 's1 |= s2' will compute s1|s2 as an immutable set and
  store that in the variable s1.  On the other hand, if s1 references
  a mutable set, 's1 |= s2' will update the object refereced by s1
  in-place.  This is similar to the behavior of lists and tuples under
  the += operator.

- The left operand of a union, intersection or difference operation
  will decide the type of the result.  This means that generally when
  you've got immutable sets, you'll produce more immutable sets; and
  when you've got mutable sets, you'll produce mutable sets.  When you
  mix the two kinds, the left argument of binary operations wins.

Some minor open questions:

- The set constructors have an optional second argument, sort_repr,
  defaulting to False, which decides whether the elements are sorted
  when str() or repr() is taken.  I'm not sure if there would be
  negative consequences of removing this argument and always sorting
  the string representation.  It would simplify the interface and make
  the output nicer (since usually you don't think about setting this
  argument until it's too late).  I don't think that the extra cost of
  sorting the list of elements will be a burden in practice, since
  normally one would only print small sets (if a set has thousands of
  elements, printing it isn't very user friendly).

- I'd like to change the module name from set.py to sets.py.  Somehow
  it makes more sense to write

    from sets import Set, ImmutableSet


    from set import Set, ImmutableSet

- I'm aware that this set implementation is not the be-all and end-all
  of sets.  I've seen a set implementation written in C, but I was not
  very impressed -- it was a massive copy-paste-edit job done on the
  dict implementation, and we don't need such code duplication.  But
  eventually there may be a C implementation which will change some
  implementation details.

- Like other concrete types in Python, these Set classes aren't really
  designed to mix well with other set implementations.  For example,
  sets of small cardinal numbers may be represented efficiently by
  long ints, but the union method currently requires that the other
  argument uses the same implementation.  I'm not sure whether this
  will eventually require changing.

Any comments?  Or shall I just check this in?

--Guido van Rossum (home page: http://www.python.org/~guido/)