[Python-Dev] PEP 218 (sets); moving set.py to Lib

Tim Peters tim_one@email.msn.com
Wed, 21 Aug 2002 00:22:20 -0400


[Guido]
> ...
> I also believe that kjbuckets maintains its data in a sorted order,
> which is unnecessary for sets -- a hash table is much faster.

It's Zope's BTrees that maintain sorted order.  kjbuckets consists of 3
variations ("set", "dict", "graph") of a rather complicated hash table,
driven by the need for the graph flavor to associate multiple values with a
single key.  The kj hash table slots each contain a small contiguous vector,
and then it starts to get complicated <wink>.

> After all we use a very fast hash table implementation to represent sets.
> (The only improvement would be that we could save maybe 4 bytes per
> hash table entry because we don't need a value pointer.)

The set flavor of kjbucket does skip the value pointer.  I suppose it could
have supported multisets too ("bags" -- duplicate keys allowed), but it
doesn't (they can be faked via a kjgraph using dummy values, though -- much
like faking a set in Python via a dict with dummy values!).

> ...
> The sets module does not implement analogous operations directly in
> Python.  Almost all the implementation work is done by the dict
> implementation.

kjbuckets has a rich set of operations coded in C, including intersection,
graph composition, and transitive closure.  If the sets module were to
sprout those operations too, they would have to look like the sets
intersection implementation (nests of Python loops and ifs), as Python dicts
don't support primitives able to polish off large chunks of the necessary
work at C speed.  Aaron's claim that kjbuckets can do those kinds of things
10x faster than Python code seems quite safe <wink>.

> ...
> kjbuckets may be nice, but adding it to the core would add a serious
> new maintenance burden for the core developers.  I don't see anyone
> raising their hand to help out here.

Not me.  It's about 3500 lines of hairy code, and you wouldn't like some of
the interface decisions it made.  For example,

   a_kjset[3] = 5

adds 3 to the set and ignores 5.  Even Aaron blushes at that one <wink>, but
some of the others would be much harder to sort out.  It would be a major
undertaking to do so.