[Python-Dev] Set data type
Ka-Ping Yee
ping@lfw.org
Thu, 3 Feb 2000 19:01:54 -0600 (EST)
Okay, so i was going to wait until i had an implementation before
bringing this up here, but it seems the other discussion has come
round to sets already so this might be a good time.
After Eric asked about sets at dinner (e.g. having 'x in dict'
do 'dict.has_key(x)') i wondered about what it would take to
support sets. Guido's response to 'x in dict' was that it was
inconsistent since 'x in list' searches the values of the list,
while the proposed 'x in dict' would search the keys.
A friend (Mark Miller, designer of E -- see www.erights.org) and
i had chatted about this before, and one idea that was suggested
was to have the mapping map each member of the set to itself.
Then 'in' would give the expected behaviour. E also manages to
overload | and & so that they do useful things with mappings-used-
as-dictionaries as well as union and intersection for mappings-
used-as-sets. I expect, however, that & will be rarely used on
mappings-used-as-dictionaries; more common is the desire for |,
which Python has solved nicely with dict.update() (also better
because it is clearly non-commutative).
So i think the clearest thing to do is to make sets a separate
built-in type. Here's the interface i was thinking of:
>>> s = {1, 5, 7} # no colons means a set
>>> type(s)
<type 'set'>
>>> s # sets are printed in hashed order
{7, 1, 5}
>>> {1: 4, 5, 7} # mixing pairs and singletons is an error
File "<stdin>", line 1
{1: 4, 5, 7}
^
SyntaxError: invalid syntax
>>> {3, 4: 5, 7} # mixing pairs and singletons is an error
File "<stdin>", line 1
{3, 4: 5, 7}
^
SyntaxError: invalid syntax
>>> 5 in s # membership
1
>>> 6 in s
0
>>> t = s # sets are mutable, so this is an alias
>>> t.append(2) # like list.append(), accepts one new member
>>> s
{7, 5, 2, 1}
>>> u = s.copy() # make a copy
>>> u.append(9)
>>> u.append([3]) # set members must be hashable
Traceback (innermost last):
File "<stdin>", line 1, in ?
TypeError: unhashable type
>>> u
{7, 5, 9, 2, 1}
>>> u.extend(range(5)) # like list.extend(), accepts a sequence or set
>>> u
{9, 7, 5, 4, 3, 2, 1, 0}
>>> u.remove(7) # like list.remove()
>>> s
{7, 5, 2, 1}
>>> s | {3, 5} # union
{7, 5, 3, 2, 1}
>>> s & {1, 2, 3} # intersection
{2, 1}
>>> s <= u # subset
1
>>> s >= s
1
>>> u <= s
0
>>> for i in s: print i # iterate in hashed order
7
5
2
1
>>> l = list(s)
>>> l.sort()
>>> for i in l: print i # iterate in sorted order
1
2
5
7
>>> s.clear() # for completeness, i suppose
>>> s
>>> s[3] # probably best not to permit subscripting
Traceback (innermost last):
File "<stdin>", line 1, in ?
TypeError: unsubscriptable object
>>> s[5:7]
Traceback (innermost last):
File "<stdin>", line 1, in ?
TypeError: unsliceable object
In short, sets get
append, extend, remove # from lists
clear, copy # from dictionaries
and the operators
&, |, in, <=, >=, <, >, ==
They could be implemented like dicts with just keys and no values.
Two open issues:
1. How to spell the empty set?
Possibilities include {,} or {-} or a constant in __builtins__.
__builtins__ would probably get a conversion function 'set'
(soon to be a type-object/constructor like int, list, etc.),
so you could just say 'set()'.
2. How do sets sort?
We could insert them somewhere in the hierarchy of types
next to lists and tuples. Currently () > [] > {}, so we
could have () > [] > {,} > {} for example.
More tricky is the question of what to do with the partial
ordering created by >, >=, etc.
a. The answer is the same as whatever answer we get when
Python supports rich comparisons and instances can have
partial orderings too.
b. We can make >, >= behave like dictionary >, >= so that
there is a defined sort order, and put the subset/superset
functionality in a method.
Option a. would be nicest since there are other good uses
for rich comparisons, and in b. we get comparison operators
that don't really do anything useful.
(Side note: Do we achieve a. just by divorcing __cmp__ from
>, >=, etc.? sorting would use __cmp__, while > and < would
look for __gt__, __lt__, etc., and if not found, fall back on
__cmp__. __cmp__ contracts to be stable [a.__cmp__(b) always
has the same outcome for a given a and b], reflexive
[if a == b, then a.__cmp__(b) == 0], consistent
[if a.__cmp__(c) == 0 and b.__cmp__(d) == 0, then
a.__cmp__(b) == c.__cmp__(d)], symmetric
[a.__cmp__(b) + b.__cmp__(a) == 0 for all a, b], and
transitive [if a.__cmp__(b) > 0 and b.__cmp__(c) > 0, then
a.__cmp__(c) > 0; if a.__cmp__(b) == 0 and b.__cmp__(c) == 0,
then a.__cmp__(c) == 0], but __gt__, __lt__ need make no
such promises. Side side note: which of these promises, if
any, should we ask __gt__ et al to make? Stability, at least?)
((Side side side note: in E, Mark Miller also ran into the
problem of spelling different kinds of "equals"es. For
object identity we have "is", for content comparison we
have "==". If we need a new operator for magnitude equality
i suggest "<=>", as used in E.))
Well, it may be a lot of words, but it's not much good unless you
are going to use it in practice. I think i would have good use for
it, but i would like to hear your opinions. Would you use such a
thing?
-- ?!ng