[Python-Dev] I think my set module is ready for prime time; comments?

Tim Peters tim.one@home.com
Wed, 24 Jan 2001 03:08:50 -0500


[Neil Schemenauer]
> I think this argues that if sets are added to the core they
> should be implemented as an extension type with the speed of
> dictionaries and the memory usage of lists.  Basicly, we would
> use the implementation of PyDict but drop the values.

They'll be slower than dicts and take more memory than lists then.  WRT
memory, dicts cache the hash code with each entry for speed (so double the
memory of a list even without the value field), and are never more than 2/3
full anyway.  The dict implementation also gets low-level speed benefits out
of using both the key and value fields to characterize the nature of a slot
(the key field is NULL iff the slot is virgin; the value field is NULL iff
the slot is available (virgin or dummy)).

Dummy slots can be avoided (and so also the need for runtime code to
distinguish them from active slots) by using a hash table of pointers to
linked lists-- or flex vectors, or linked lists of small vectors --instead,
and in most ways that leads to much simpler code (no more fiddling with
dummies, no more probe-sequence hassles, no more boosting the size before
the table is full).  But without fine control over the internals of malloc,
that takes even more memory in the end.

Interesting twist:  "a dict" *is* "a set", but a set of (key, value) pairs
further constrained so that no two elements have the same key.  So any set
implementation can be used as-is to implement a dict as a set of 2-tuples,
customizing the hash and "is equal" functions to look at just the tuples'
first elements.  The was the view taken by SETL in 1969, although their
"map" (dict) type was eventually optimized to get away from actually
constructing 2-tuples.  Indeed, SETL eventually grew an elaborate optional
type declaration sublanguage, allowing the user to influence many details of
its many internal set-storage schemes; e.g., from pg 399 of "Programming
With Sets:  An Introduction to SETL":

    For example, we can declare [I'm putting their keywords in UPPERCASE
    for, umm, clarity]

        successors: LOCAL MMAP(ELMT b) REMOTE SET(ELMT b);

    This declaration specifies that for each x in b the image set
    successors{x} is stored in the element block of x, and that this
    image set is always to be represented as a bit vector.  Similarly,
    the declaration

        successors: LOCAL MMAP(ELMT b) SPARSE SET(ELMT b);

    specifies that for each x in b the image set successors{x} is to
    be stored as a hash table containing pointers to elements of b.
    Note that the attribute LOCAL cannot be used for image sets of
    multivalued maps,  This follows from the remarks in section 10.4.3
    on the awkwardness of making local objects into subparts of
    composite objects.

Clear?  Snort.  Here are some citations lifted from the web for their
experience in trying to make these kinds of decisions by magic:

@article{dewar:79,
title="Programming by Refinement, as Exemplified by the {SETL}
Representation Sublanguage",
author="Robert B. K. Dewar and Arthur Grand and Ssu-Cheng Liu and
Jacob T. Schwartz and Edmond Schonberg",
journal=toplas,
year=1979,
month=jul,
volume=1,
number=1,
pages="27--49"
}

@article{schonberg:81,
title="An Automatic Technique for Selection of Data Structures in
{SETL} Programs",
author="Edmond Schonberg and Jacob T. Schwartz and Micha Sharir",
journal=toplas,
year=1981,
month=apr,
volume=3,
number=2,
pages="126--143"
}

@article{freudenberger:83,
title="Experience with the {SETL} Optimizer",
author="Stefan M. Freudenberger and Jacob T. Schwartz and Micha Sharir",
pages="26--45",
journal=toplas,
year=1983,
month=jan,
volume=5,
number=1
}

If someone wanted to take sets seriously today, a better approach would be
to define a minimal "set interface" ("abstract base class" in C++ terms),
then supply multiple implementations of that interface, letting the user
choose directly which implementation strategy they want for each of their
sets.  And people are doing just that in the C++ and Java worlds; e.g.,

http://developer.java.sun.com/developer/onlineTraining/
    collections/Collection.html#SetInterface

Curiously, the newer Java Collections Framework (covering multiple
implementations of list, set, and dict interfaces) gave up on thread-safety
by default, because it cost too much at runtime.  Just another thing to
argue about <wink>.

we're-not-exactly-pioneers-here-ly y'rs  - tim