[Python-Dev] Simple dicts

Tim Peters tim_one@email.msn.com
Tue, 20 May 2003 23:56:11 -0400


[damien morton]
> ...
> I need to address Tim's concerns about the poor hash function used for
> python integers, but I think this can be addressed easily enough.

Caution:  in the context of the current scheme, it's an excellent hash
function.  No hash function could be cheaper to compute, and in the common
case of dicts indexed by a contiguous range of integers, there are no
collisions at all.  Christian Tismer contributed the pathological case in
the dictobject.c comments, but I don't know that any such case has been seen
in real life; the current scheme does OK with it.

> I would welcome some guidance about what hash functions need to be
> addressed though. Is it just integers?

Because, e.g., 42 == 42.0 == 42L, and objects that compare equal must have
equal hashcodes, what we do for ints has to be duplicated for at least some
floats and longs too, and more generally for user-defined numeric types that
can call themselves equal to ints (for example, rationals).  For this reason
it may not be possible to change the hash code for integers (although it
would be possible to scramble the incoming hash code when mapping to a table
slot, which is effectively what the current scheme does but only when a
primary collision occurs).

The string hash code is regular for "consecutive" strings, too (like "ab1",
"ab2", "ab3", ...).

Instances of user-defined classes that don't define their own __hash__
effectively use the memory address as the hash code, and of course that's
also very regular across objects at times.

> (theres a great article on integer hash functions at
> www.cris.com/~Ttwang/tech/inthash.htm)

Cool!  I hadn't seen that before -- thanks for the link.