[Python-Dev] Python hash function

Tue, 24 Jun 2003 17:03:41 -0400

Ive been doing some investigation of python hashtables and the hash
function used in Python.

It turns out, that when running pystone.py, as much time is spent in
string_hash() as is spent in lookdict_string(). If you look at
python's hash function, shown below, you can get an idea why - a
multiply is used for every character in the string.

The following function, called 2^22 times in a loop, times in at 530ms
for 7-character strings.

static long
string_hash1(register char *p, int size)
{
	register int len;
	register long x;

	len = size;
	x = *p << 7;
	while (--len >= 0)
		x = (100003*x) ^ *p++;
	x ^= size;
	return x;
}

Looking around the web I found the following hash function, which is
much faster, and seems to distribute better than the python hash
function.

The following function, called 2^22 times in a loop, times in at 160ms
for 7-character strings.

static long
string_hash3(register char *p, register int size)
{
	register long x;
	register char c;

	x = 0xd2d84a61;
        while (c = *p++)
		x ^= ( (x<<7) + c + (x>>5) );
	return x;
}

I also came up with a hash function of my own, which seems to
distribute near-perfectly (in the sense of being comparable to
assigning, as hashes, random numbers to unique strings) and be faster
yet (at least, on my P4 box).

The following function, called 2^22 times in a loop, times in at 120ms
for 7-character strings.

static long
string_hash6(register unsigned short *p, int size)
{
	register unsigned long x = 0;

      if (size==0) return 0;

	len = (size+1) >> 1;
	while (len--) {
		x += (x>>14) + (*p++ * 0xd2d84a61);
		}
	x += (x>>14) + (size*0xd2d84a61);

	return x;
}

Ive tested these functions by hashing a large set of strings generated
by instrumenting the python interpeter to emit a string every time a
string is added to a dictionary. These strings are hashed and thrown
into the buckets of various sized tables. I then calculate sigma
statistics (sum((observed-expected)^2)/(N-1)) for the number of items
in the buckets of those tables.

Here are the sum(sigma) results I am getting in my tests:
string_hash1       sum(sigma) ~= 30K
string_hash3       sum(sigma) ~= 23-24K
string_hash6       sum(sigma) ~= 23-34K
string_hashRandom  sum(sigma) ~= 23-24K

Maddeningly, I havent been able to translate these near-perfect
distributions into improvements in pystones.

With the current hash function, pystone does an average of 1.2 probes in
the lookdict_string() function. With either of these "near-perfect" hash
functions, it does at best 1.3 probes (given exhaustive searches for
'good' shift paramaters, though not multipliers).

Im not sure what other tests to try out, or how I might go about
tweaking the hashes to perform better in pystone. Im hoping that someone
on python-dev has some experience in testing hash functions, and can
suggest some additional tests and/or tweaks.