[Python-Dev] Set data type

Thu, 3 Feb 2000 19:01:54 -0600 (EST)

Okay, so i was going to wait until i had an implementation before
bringing this up here, but it seems the other discussion has come
round to sets already so this might be a good time.

After Eric asked about sets at dinner (e.g. having 'x in dict'
do 'dict.has_key(x)') i wondered about what it would take to
support sets.  Guido's response to 'x in dict' was that it was
inconsistent since 'x in list' searches the values of the list,
while the proposed 'x in dict' would search the keys.

A friend (Mark Miller, designer of E -- see www.erights.org) and
i had chatted about this before, and one idea that was suggested
was to have the mapping map each member of the set to itself.
Then 'in' would give the expected behaviour.  E also manages to
overload | and & so that they do useful things with mappings-used-
as-dictionaries as well as union and intersection for mappings-
used-as-sets.  I expect, however, that & will be rarely used on
mappings-used-as-dictionaries; more common is the desire for |,
which Python has solved nicely with dict.update() (also better
because it is clearly non-commutative).

So i think the clearest thing to do is to make sets a separate
built-in type.  Here's the interface i was thinking of:

    >>> s = {1, 5, 7}           # no colons means a set

    >>> type(s)
    <type 'set'>

    >>> s                       # sets are printed in hashed order
    {7, 1, 5}

    >>> {1: 4, 5, 7}            # mixing pairs and singletons is an error
      File "<stdin>", line 1
        {1: 4, 5, 7}
                ^
    SyntaxError: invalid syntax

    >>> {3, 4: 5, 7}            # mixing pairs and singletons is an error
      File "<stdin>", line 1
        {3, 4: 5, 7}
             ^    
    SyntaxError: invalid syntax

    >>> 5 in s                  # membership
    1

    >>> 6 in s
    0

    >>> t = s                   # sets are mutable, so this is an alias

    >>> t.append(2)             # like list.append(), accepts one new member

    >>> s
    {7, 5, 2, 1}

    >>> u = s.copy()            # make a copy

    >>> u.append(9)

    >>> u.append([3])           # set members must be hashable
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
    TypeError: unhashable type

    >>> u
    {7, 5, 9, 2, 1}

    >>> u.extend(range(5))      # like list.extend(), accepts a sequence or set

    >>> u
    {9, 7, 5, 4, 3, 2, 1, 0}

    >>> u.remove(7)             # like list.remove()

    >>> s
    {7, 5, 2, 1}

    >>> s | {3, 5}              # union
    {7, 5, 3, 2, 1}

    >>> s & {1, 2, 3}           # intersection
    {2, 1}

    >>> s <= u                  # subset
    1

    >>> s >= s
    1

    >>> u <= s
    0

    >>> for i in s: print i     # iterate in hashed order
    7
    5
    2
    1

    >>> l = list(s)
    >>> l.sort()
    >>> for i in l: print i     # iterate in sorted order
    1
    2
    5
    7

    >>> s.clear()               # for completeness, i suppose
    >>> s

    >>> s[3]                    # probably best not to permit subscripting
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
    TypeError: unsubscriptable object

    >>> s[5:7]
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
    TypeError: unsliceable object

In short, sets get

    append, extend, remove      # from lists
    clear, copy                 # from dictionaries

and the operators

    &, |, in, <=, >=, <, >, ==

They could be implemented like dicts with just keys and no values.

Two open issues:

    1.  How to spell the empty set?

        Possibilities include {,} or {-} or a constant in __builtins__.
        __builtins__ would probably get a conversion function 'set'
        (soon to be a type-object/constructor like int, list, etc.),
        so you could just say 'set()'.

    2.  How do sets sort?

        We could insert them somewhere in the hierarchy of types
        next to lists and tuples.  Currently () > [] > {}, so we
        could have () > [] > {,} > {} for example.

        More tricky is the question of what to do with the partial
        ordering created by >, >=, etc.

        a.  The answer is the same as whatever answer we get when
            Python supports rich comparisons and instances can have
            partial orderings too.

        b.  We can make >, >= behave like dictionary >, >= so that
            there is a defined sort order, and put the subset/superset
            functionality in a method.

        Option a. would be nicest since there are other good uses
        for rich comparisons, and in b. we get comparison operators
        that don't really do anything useful.

        (Side note: Do we achieve a. just by divorcing __cmp__ from
        >, >=, etc.?  sorting would use __cmp__, while > and < would
        look for __gt__, __lt__, etc., and if not found, fall back on
        __cmp__.  __cmp__ contracts to be stable [a.__cmp__(b) always
        has the same outcome for a given a and b], reflexive
        [if a == b, then a.__cmp__(b) == 0], consistent
        [if a.__cmp__(c) == 0 and b.__cmp__(d) == 0, then
        a.__cmp__(b) == c.__cmp__(d)], symmetric
        [a.__cmp__(b) + b.__cmp__(a) == 0 for all a, b], and
        transitive [if a.__cmp__(b) > 0 and b.__cmp__(c) > 0, then
        a.__cmp__(c) > 0; if a.__cmp__(b) == 0 and b.__cmp__(c) == 0,
        then a.__cmp__(c) == 0], but __gt__, __lt__ need make no
        such promises.  Side side note: which of these promises, if
        any, should we ask __gt__ et al to make?  Stability, at least?)

        ((Side side side note: in E, Mark Miller also ran into the
        problem of spelling different kinds of "equals"es.  For
        object identity we have "is", for content comparison we
        have "==".  If we need a new operator for magnitude equality
        i suggest "<=>", as used in E.))

Well, it may be a lot of words, but it's not much good unless you
are going to use it in practice.  I think i would have good use for
it, but i would like to hear your opinions.  Would you use such a
thing? 

-- ?!ng