[Python-3000] Performance Notes - new hash algorithm
Jim Jewett
jimjjewett at gmail.com
Mon Sep 10 02:58:34 CEST 2007
On 9/8/07, Tim Peters <tim.peters at gmail.com> wrote:
> in comments in dictobject.c. As it notes there, hashing the strings
> "namea", "nameb", "namec", and "named" currently produces (on a
> sizeof(long) == 4 box):
> -1658398457
> -1658398460
> -1658398459
> -1658398462
> That the hash codes are very close but not identical is "a feature",
> since the dict implementation only looks at the last k bits (for
> various more-or-less small values of k): this gives "better than
> random" dict collision behavior for input strings very close together.
> The proposed hash produces instead:
> 1892683363
> -970432008
> 51735791
> 1567337715
>
> Obviously much closer to "random" behavior, but that's not necessarily
> a good thing for dicts.
To spell this out a bit more:
For cryptography, you want a "random" has function. For hash tables,
you just want one that spreads out your actual input. For strings,
this tends to mean short strings that look like possible variable
names. Because they often *are* variable names, they are sometimes
sequential, like var_a, var_b, var_c.
In the current CPython implementation, dicts start as a size-8
smalldict, and most dicts never grow beyond that. So the effective
hash is really (hash%8)
When adding four entries to an 8-slot table, a truly random hash would
have at least one collision (0/8 + 1/8 + 2/8 + 3/8 =) 3/4 of the
time. As expected, the proposed hash does have a collision for those
four values (the first and fourth).
The current hash function does not collide for strings that change
only one character to the "next" in ASCIIbetical order until the 9th
string -- at which time you need to resize anyhow.
For larger tables, having them close still doesn't cause a problem,
and may even be useful if you do decide to sort the keys. (CPython
lists use a "timsort" that takes advantage of partially sorted input,
so if the iterator gets them close to sorted initially, that can
help.)
-jJ
More information about the Python-3000
mailing list