[Python-3000] Performance Notes - new hash algorithm
tim.peters at gmail.com
Sun Sep 9 03:48:56 CEST 2007
> I'd like Tim Peters's input on this before we change it. I seem to
> recall that there's an aspect of non-randomness to the existing hash
> function that's important when you hash many closely related strings,
> e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been
> so long that I may misremember this, and perhaps it was related to the
> dictionary implementation.
Not "important" so much as "possibly helpful" ;-) This is explained
in comments in dictobject.c. As it notes there, hashing the strings
"namea", "nameb", "namec", and "named" currently produces (on a
sizeof(long) == 4 box):
That the hash codes are very close but not identical is "a feature",
since the dict implementation only looks at the last k bits (for
various more-or-less small values of k): this gives "better than
random" dict collision behavior for input strings very close together.
The proposed hash produces instead:
Obviously much closer to "random" behavior, but that's not necessarily
a good thing for dicts.
Python's current string hash is very similar to (but developed
independently of) the FNV hash.
Things to look out for in the proposed hash:
- There's no explanation of where all the magic shift
constants and shift patterns come from.
- It relies on potentially unaligned access to read 16-bit chunks
at a time. This means #ifdef cruft to "turn that off" on platforms
that don't support unaligned access, and means timing will vary
on platforms that do (depending on whether input strings do or
do not /happen/ to be 2-byte aligned).
- It only delivers a 32-bit hash. But at least before Py3K, Python's
hash codes are the native C "long" (32 or 64 bits on all current
boxes). The current hash code couldn't care less what sizeof(long)
is. It's not clear how to modify the proposed hash to deliver
64-bit hash codes, in large part because of the first point above.
- It needs another conditional "at the bottom" to avoid returning
a hash code of -1. That will affect timing too.
>>> Python3k original hash: real 0m2.210s
>>> new hash: real 0m1.842s
That's actually a surprisingly small difference, given the much larger
timing differences displayed on:
compared to the FNV hash. OTOH, the figures there only looked at
256-byte strings, which is much larger (IMO) "than average" for
Better tests would time building and accessing string-keyed dicts with
reasonable and unreasonable ;-) keys.
More information about the Python-3000