Ive been doing some investigation of python hashtables and the hash function used in Python. It turns out, that when running pystone.py, as much time is spent in string_hash() as is spent in lookdict_string(). If you look at python's hash function, shown below, you can get an idea why - a multiply is used for every character in the string. The following function, called 2^22 times in a loop, times in at 530ms for 7-character strings. static long string_hash1(register char *p, int size) { register int len; register long x; len = size; x = *p << 7; while (--len >= 0) x = (100003*x) ^ *p++; x ^= size; return x; } Looking around the web I found the following hash function, which is much faster, and seems to distribute better than the python hash function. The following function, called 2^22 times in a loop, times in at 160ms for 7-character strings. static long string_hash3(register char *p, register int size) { register long x; register char c; x = 0xd2d84a61; while (c = *p++) x ^= ( (x<<7) + c + (x>>5) ); return x; } I also came up with a hash function of my own, which seems to distribute near-perfectly (in the sense of being comparable to assigning, as hashes, random numbers to unique strings) and be faster yet (at least, on my P4 box). The following function, called 2^22 times in a loop, times in at 120ms for 7-character strings. static long string_hash6(register unsigned short *p, int size) { register unsigned long x = 0; if (size==0) return 0; len = (size+1) >> 1; while (len--) { x += (x>>14) + (*p++ * 0xd2d84a61); } x += (x>>14) + (size*0xd2d84a61); return x; } Ive tested these functions by hashing a large set of strings generated by instrumenting the python interpeter to emit a string every time a string is added to a dictionary. These strings are hashed and thrown into the buckets of various sized tables. I then calculate sigma statistics (sum((observed-expected)^2)/(N-1)) for the number of items in the buckets of those tables. Here are the sum(sigma) results I am getting in my tests: string_hash1 sum(sigma) ~= 30K string_hash3 sum(sigma) ~= 23-24K string_hash6 sum(sigma) ~= 23-34K string_hashRandom sum(sigma) ~= 23-24K Maddeningly, I havent been able to translate these near-perfect distributions into improvements in pystones. With the current hash function, pystone does an average of 1.2 probes in the lookdict_string() function. With either of these "near-perfect" hash functions, it does at best 1.3 probes (given exhaustive searches for 'good' shift paramaters, though not multipliers). Im not sure what other tests to try out, or how I might go about tweaking the hashes to perform better in pystone. Im hoping that someone on python-dev has some experience in testing hash functions, and can suggest some additional tests and/or tweaks.
I do know about ob_shash, and youre probably right that most strings used by pystone are interned. Even so, my running python/pystone through vtune showed that as much time is spent in string_hash() as lookdict_string(). Lookdict_string() is, of course, called many more times than string_hash() - but the total self time for each function is comparable. Not only does the current hash function use a multiply for every character, but it doesn't parralelise very well. The functions I have been playing with are about 4 times faster than the current function, while using only half the multiplies, adds and loop iterations.
-----Original Message----- From: Neil Schemenauer [mailto:nas-python@python.ca] Sent: Wednesday, 25 June 2003 00:27 To: damien.morton@acm.org Cc: python-dev@python.org Subject: Re: [Python-Dev] Python hash function
Damien Morton wrote:
Maddeningly, I havent been able to translate these near-perfect distributions into improvements in pystones.
No time to look at this in deal but do you know about ob_shash? I suspect most strings used by pystone are interned.
Neil
participants (3)
-
Damien Morton -
damien morton -
Neil Schemenauer