[Python-Dev] Python hash function
Damien Morton
damien.morton@acm.org
Tue, 24 Jun 2003 17:03:41 -0400
Ive been doing some investigation of python hashtables and the hash
function used in Python.
It turns out, that when running pystone.py, as much time is spent in
string_hash() as is spent in lookdict_string(). If you look at
python's hash function, shown below, you can get an idea why - a
multiply is used for every character in the string.
The following function, called 2^22 times in a loop, times in at 530ms
for 7-character strings.
static long
string_hash1(register char *p, int size)
{
register int len;
register long x;
len = size;
x = *p << 7;
while (--len >= 0)
x = (100003*x) ^ *p++;
x ^= size;
return x;
}
Looking around the web I found the following hash function, which is
much faster, and seems to distribute better than the python hash
function.
The following function, called 2^22 times in a loop, times in at 160ms
for 7-character strings.
static long
string_hash3(register char *p, register int size)
{
register long x;
register char c;
x = 0xd2d84a61;
while (c = *p++)
x ^= ( (x<<7) + c + (x>>5) );
return x;
}
I also came up with a hash function of my own, which seems to
distribute near-perfectly (in the sense of being comparable to
assigning, as hashes, random numbers to unique strings) and be faster
yet (at least, on my P4 box).
The following function, called 2^22 times in a loop, times in at 120ms
for 7-character strings.
static long
string_hash6(register unsigned short *p, int size)
{
register unsigned long x = 0;
if (size==0) return 0;
len = (size+1) >> 1;
while (len--) {
x += (x>>14) + (*p++ * 0xd2d84a61);
}
x += (x>>14) + (size*0xd2d84a61);
return x;
}
Ive tested these functions by hashing a large set of strings generated
by instrumenting the python interpeter to emit a string every time a
string is added to a dictionary. These strings are hashed and thrown
into the buckets of various sized tables. I then calculate sigma
statistics (sum((observed-expected)^2)/(N-1)) for the number of items
in the buckets of those tables.
Here are the sum(sigma) results I am getting in my tests:
string_hash1 sum(sigma) ~= 30K
string_hash3 sum(sigma) ~= 23-24K
string_hash6 sum(sigma) ~= 23-34K
string_hashRandom sum(sigma) ~= 23-24K
Maddeningly, I havent been able to translate these near-perfect
distributions into improvements in pystones.
With the current hash function, pystone does an average of 1.2 probes in
the lookdict_string() function. With either of these "near-perfect" hash
functions, it does at best 1.3 probes (given exhaustive searches for
'good' shift paramaters, though not multipliers).
Im not sure what other tests to try out, or how I might go about
tweaking the hashes to perform better in pystone. Im hoping that someone
on python-dev has some experience in testing hash functions, and can
suggest some additional tests and/or tweaks.