[Python-Dev] Questions about sets.py

Guido van Rossum guido@python.org
Fri, 23 Aug 2002 17:21:16 -0400


> 1. BaseSet contains 3 blobs like this:
> 
>     def __or__(self, other):
>         """Return the union of two sets as a new set.
> 
>         (I.e. all elements that are in either set.)
>         """
>         if not isinstance(other, BaseSet):
>             return NotImplemented
>         result = self.__class__(self._data)
>         result._data.update(other._data)
>         return result
> 
>     def union(self, other):
>         """Return the union of two sets as a new set.
> 
>         (I.e. all elements that are in either set.)
>         """
>         return self | other
> 
> Is there a good reason not to write the latter as just
> 
>     union = __or__
> 
> ?

Yes.  It was written like that before!  But in order to be a good
citizen in the world of binary operators, __or__ should not raise
TypeError; it should return NotImplemented if the other argument is
not a set (since the other argument might implement __ror__ and know
how to or itself with a set).

But union(), which is a normal function, should not return
NotImplemented, which would confuse the user.  So they have to be
different.  I thought it would be best for union() to use the |
operator so that if the other argument implements __ror__, union()
will acquire this ability.

> 2. Is there a particular reason for not coding issuperset as
> 
>     def issuperset(self, other):
>         """Report whether this set contains another set."""
>         self._binary_sanity_check(other)
>         return other.issubset(self)
> 
> ?  Given that issubset exists, issuperset is of marginal value anyway.

The original code didn't have issuperset(), I added it for symmetry.
Spelling it out saves two calls: one to _binary_sanity_check(), one to
issubset().

> 3. BaseSet._update is a darned cute example of exploiting that the iterator
> returned by iter() isn't restartable!.  That isn't a question, it's
> just a giggle <wink>.

Yes, I like it. :-)

> 4. Who believes that the __le__, __lt__, __ge__, and __gt__ methods are a
> good idea?  If anything, I'd expect s <= t to mean "is subset", and
> "s < t" to mean "is proper subset".  Getting the lexicographic
> ordering of the underlying dicts instead doesn't seem to be of any
> use, except perhaps to prevent sorting lists of sets from blowing
> up.  Fine by me if that blows up, though.

Greg Wilson added these when he made the class inherit from dict,
presumably because without any further measures, sets would be
comparable to dicts using the default dict comparison.

That design choice was later undone by Alex (at my suggestion), but he
fixed the comparisons rather than removing them.

I think using <=, < etc. to spell issubset and isstrictsubset would be
great.

> 5. It's curious enough that we avoid dict.copy() in
> 
>     def copy(self):
>         """Return a shallow copy of a set."""
>         result = self.__class__([])
>         result._data.update(self._data)
>         return result
> 
> that if there's a reason to avoid it a comment would help.

Raymond asked me whether to use copy() or update().  I looked at the
code and found that they execute almost the same code, with about one
instruction more per item for update().  But for small sets, copy()
allocates a new dict (and the old one is thrown away).  I thought that
that might be more important than saving an instruction per item.

> 6. It seems that doing
> 
>     self.__class__([])
> 
> in various places instead of
> 
>     self.__class__()
> 
> wastes time without reason (it builds a unique empty list each time,
> the __init__ function then does a useless iteration dance over that,
> and finally the list object is torn apart again).  If the intent is
> to communicate that we're creating an empty set, IMO the latter
> spelling is even a bit clearer about that (I see "[]" and keep
> wondering what it's trying to accomplish).

That was when ImmutableSet() required an argument.  It can be left out
now.

> 7. union_update, intersection_update, symmetric_difference_update,
> and difference_update return self despite mutating in-place.  That
> makes them unique among mutating container methods (e.g.,
> list.append, list.insert, list.remove, dict.update, list.sort, ...,
> return None).  Is the inconsistency worth it?  Chaining mutating set
> operations isn't common, and with names like
> "symmetric_difference_update()" it's a challenge to fit more than
> one on a line anyway <wink>.
> 
> If it's thought that chaining mutating operations is somehow a good
> idea for sets when it's not for other containers, then we really
> have to be consistent about it in the sets module.  For example,
> then Set.add() should return self too; indeed,
> set.add(elt1).add(elt2) may even be pleasant at times.
> 
> Or if the point was merely to create "nice names" for __ior__ etc,
> then, e.g., the existing union_update should be renamed to __ior__,
> and union_update defined as
> 
>     def union_update(self, other):
>         """yadda"""
>         self |= other
> 
> and let it return None.  In a sense this is the opposite of question #1,
> where the extra code block *is* supplied but without an apparent need.

You guessed right.  That's the best solution IMO.

> 8. If there's something still valuable in _test(), I think it ought
> to be moved into test_sets.py.  "Self-testing modules" can be
> convenient when developing, but after modules are deployed in the
> std library the embedded tests are never run again (with the
> exception of module doctests, which can easily be run via a
> regrtest-flavor test_xyz test, and which are so invoked).

Please toss it.

--Guido van Rossum (home page: http://www.python.org/~guido/)