[Python-Dev] I think my set module is ready for prime time; comments?

Tim Peters tim.one@home.com
Tue, 23 Jan 2001 20:51:22 -0500

[Christopher Petrilli]
> ....
> Unfortunately, for me, a Python implementation of Sets is only
> interesting academicaly.  Any time I've needed to work with them at a
> large scale, I've needed them *much* faster than Python could achieve
> without a C extension.

How do you know that?  I've used large sets in Python happily without
resorting to C or kjbuckets (which is really aiming at fast operations on
*graphs*, in which area it has no equal).

Everyone (except Eric <wink>) uses dicts to implement sets in Python, and
"most" set operations can work at full C speed then; e.g., assuming both
sets have N elements:

    membership testing
        O(1) -- it's just dict.has_key()
    element insertion
        O(1) -- dict[element] = 1
    element removal
        O(1) -- del dict[element]
        O(N), but at full C speed -- dict1.update(dict2)
        O(N), but at Python speed (the only 2.1 dog in the bunch!)
    choose some element and remove it
        took O(N) time and additional space in 2.0, but
        is O(1) in both since dict.pop() was introduced
        O(N), with O(N) additional space using dict.keys(),
        or O(1) additional space using dict.pop() repeatedly

What are you going to do in C that's faster than using a Python dict for
this purpose?  Most key set operations are straightforward Python dict
1-liners then, and Python dicts are very fast.  kjbuckets sets were slower
last time I timed them (several years ago, but Python dicts have gotten
faster since then while kjbuckets has been stagnant).

There's a long tradition in the Lisp world of using unordered lists to
represent sets (when the only tool you have is a hammer ... <0.5 wink>), but
it's been easy to do much better than that in Python almost since the start.
Even in the Python list world, enormous improvements for large sets can be
gotten by maintaining lists in sorted order (then most O(N) operations drop
to O(log2(N)), and O(N**2) to O(N)).  Curiously, though, in 2.1 we can still
use a dict-set for complex numbers, but no longer a sorted-list-set!
Requiring a total ordering can get in the way more than requiring
hashability (and vice versa -- that's a tough one).

measurement-is-the-measure-of-all-measurable-things-ly y'rs  - tim