[Python-3000] Performance Notes - new hash algorithm

Sun Sep 9 03:48:56 CEST 2007

[Guido]
> I'd like Tim Peters's input on this before we change it. I seem to
> recall that there's an aspect of non-randomness to the existing hash
> function that's important when you hash many closely related strings,
> e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been
> so long that I may misremember this, and perhaps it was related to the
> dictionary implementation.

Not "important" so much as "possibly helpful" ;-)  This is explained
in comments in dictobject.c.  As it notes there, hashing the strings
"namea", "nameb", "namec", and "named" currently produces (on a
sizeof(long) == 4 box):

-1658398457
-1658398460
-1658398459
-1658398462

That the hash codes are very close but not identical is "a feature",
since the dict implementation only looks at the last k bits (for
various more-or-less small values of k):  this gives "better than
random" dict collision behavior for input strings very close together.

The proposed hash produces instead:

 1892683363
 -970432008
   51735791
 1567337715

Obviously much closer to "random" behavior, but that's not necessarily
a good thing for dicts.

FYI, wrt

    http://www.azillionmonkeys.com/qed/hash.html

Python's current string hash is very similar to (but developed
independently of) the FNV hash.

Things to look out for in the proposed hash:

- There's no explanation of where all the magic shift
  constants and shift patterns come from.

- It relies on potentially unaligned access to read 16-bit chunks
  at a time.  This means #ifdef cruft to "turn that off" on platforms
  that don't support unaligned access, and means timing will vary
  on platforms that do (depending on whether input strings do or
  do not /happen/ to be 2-byte aligned).

- It only delivers a 32-bit hash.  But at least before Py3K, Python's
  hash codes are the native C "long" (32 or 64 bits on all current
  boxes).  The current hash code couldn't care less what sizeof(long)
  is.  It's not clear how to modify the proposed hash to deliver
  64-bit hash codes, in large part because of the first point above.

- It needs another conditional "at the bottom" to avoid returning
  a hash code of -1.  That will affect timing too.

>>> Python3k original hash: real    0m2.210s
>>>               new hash: real    0m1.842s

That's actually a surprisingly small difference, given the much larger
timing differences displayed on:

    http://www.azillionmonkeys.com/qed/hash.html

compared to the FNV hash.  OTOH, the figures there only looked at
256-byte strings, which is much larger (IMO) "than average" for
strings.

Better tests would time building and accessing string-keyed dicts with
reasonable and unreasonable ;-) keys.