[Python-Dev] Simple dicts

Tim Peters tim.one@comcast.net
Thu, 15 May 2003 10:33:56 -0400


[Brett C.]
> When I took data structures I was taught that chaining was actually the
> easiest way to do hash tables and they still had good performance
> compared to open addressing.  Because of this taught bias I always
> wondered why Python used open addressing; can someone tell me?

malloc overhead is a major drag; as my msg said, the feasibility depends on
pymalloc's very low overhead, and that dictentry nodes are 12 bytes apiece
even now; pymalloc didn't exist back then, and Python wasn't originally
micro-optimized as it is now (e.g., there wasn't the current zoo of
dedicated free-lists, or pre-allocation strategies, and dicts could *only*
be indexed by strings so collision only had to worry about the kinds of
problems a single known hash function was prone to).

> I am interested in seeing how this would pan out, but I am unfortunately
> going to be busy for the next three days (if anyone is going to be at E3
> Thursday or Friday for some odd reason let me know since I will be
> there).  If someone takes this up please let me know; I am interested in
> helping if I can.  Perhaps this should be a sandbox thing?

There's no rush <wink>, and I'd be surprised if Python adopted a different
scheme in the end anyway.  It's likely a just-for-fun to-see-what-happens
project.

Note one nasty class of problem:  in chaining *only* primary collisions
exist.  The current dict implementation turned the problem of
open-addressing's secondary collisions into "a feature", which will become
clear when you contemplate dictobject.c's

    [i << 16 for i in range(20000)]

example.  Python's original dict design didn't have a problem with this
because it used prime numbers for table sizes and reduced 32-bit hashes via
mod-by-a-prime.  The current scheme of just grabbing the last i bits is both
much faster and more delicate than that, and we really rely on the collision
resolution strategy now to protect against unlucky bit patterns.

Another way of looking at this is that the current scheme has a way to get
all 32 bits of a hash code to participate in which table slot gets selected;
mod-by-an-odd-prime also gets all bits into play; peel-off-the-last-i-bits
does not.