
Greg Wilson wrote:
Well, the results are not so bad: I stopped testing computation time for the Python dictionary implementation, in favor of "trips". How many trips does the re-hash take in a dictionary? Tests were run for dictionaries of size 1000, 2000, 3000, 4000. Dictionary 1 consists of i, formatted as string. Dictionary 2 consists of strings containig the binary of i. Dictionary 3 consists of random numbers. Dictionary 4 consists of i << 16. Algorithms: old is the original dictionary algorithm implemented in Python (probably quite correct now, using longs :-) new is the proposed incremental bits-suck-in-division algorithm. new2 is a version of new, where all extra bits of the hash function are wheeled in in advance. The computation time of this is not neglectible, so please use this result for reference, only. Here the results: (bad integers(old) not computed for n>1000 ) """ D:\crml_doc\platf\py>python dictest.py N=1000 trips for strings old=293 new=302 new2=221 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=499500 new=13187 new2=999 trips for random integers old=377 new=371 new2=393 trips for windows names old=230 new=207 new2=200 N=2000 trips for strings old=1093 new=1109 new2=786 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=26455 new2=1999 trips for random integers old=691 new=710 new2=741 trips for windows names old=503 new=542 new2=564 N=3000 trips for strings old=810 new=839 new2=609 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=38681 new2=2999 trips for random integers old=762 new=740 new2=735 trips for windows names old=712 new=711 new2=691 N=4000 trips for strings old=1850 new=1843 new2=1375 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=52994 new2=3999 trips for random integers old=1440 new=1450 new2=1414 trips for windows names old=1449 new=1434 new2=1457 D:\crml_doc\platf\py> """ Interpretation: --------------- Short numeric strings show a slightly too high trip number. This means that the hash() function could be enhanced. But the effect would be below 10 percent compared to random hashes, therefore not worth it. Binary representations of numbers as strings still create perfect hash numbers. Bad integers (complete hash clash due to high power of 2) are handled fairly well by the new algorithm. "new2" shows that they can be brought down to nearly perfect hashes just by applying the "hash melting wheel": Windows names are almost upper case, and almost verbose. They appear to perform nearly as well as random numbers. This means: The Python string has function is very good for a wide area of applications. In Summary: I would try to modify the string hash function slightly for short strings, but only if this does not negatively affect the results of above. Summary of summary: There is no really low hanging fruit in string hashing. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com ## dictest.py ## Test of a new rehash algorithm ## Chris Tismer ## 2000-12-17 ## Mission Impossible 5oftware Team # The following is a partial re-implementation of # Python dictionaries in Python. # The original algorithm was literally turned # into Python code. ##/* ##Table of irreducible polynomials to efficiently cycle through ##GF(2^n)-{0}, 2<=n<=30. ##*/ polys = [ 4 + 3, 8 + 3, 16 + 3, 32 + 5, 64 + 3, 128 + 3, 256 + 29, 512 + 17, 1024 + 9, 2048 + 5, 4096 + 83, 8192 + 27, 16384 + 43, 32768 + 3, 65536 + 45, 131072 + 9, 262144 + 39, 524288 + 39, 1048576 + 9, 2097152 + 5, 4194304 + 3, 8388608 + 33, 16777216 + 27, 33554432 + 9, 67108864 + 71, 134217728 + 39, 268435456 + 9, 536870912 + 5, 1073741824 + 83, 0 ] polys = map(long, polys) class NULL: pass class Dictionary: dummy = "<dummy key>" def __init__(mp, newalg=0): mp.ma_size = 0 mp.ma_poly = 0 mp.ma_table = [] mp.ma_fill = 0 mp.ma_used = 0 mp.oldalg = not newalg mp.warmup = newalg>1 mp.trips = 0 def getTrips(self): trips = self.trips self.trips = 0 return trips def lookdict(mp, key, _hash): me_hash, me_key, me_value = range(3) # rec slots dummy = mp.dummy mask = mp.ma_size-1 ep0 = mp.ma_table i = (~_hash) & mask ep = ep0[i] if ep[me_key] is NULL or ep[me_key] == key: return ep if ep[me_key] == dummy: freeslot = ep else: if (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0) : return ep freeslot = NULL ###### FROM HERE if mp.oldalg: incr = (_hash ^ (_hash >> 3)) & mask else: # note that we do not mask! # the shifting is worth it in the incremental case. ## added after posting to python-dev: uhash = _hash & 0xffffffffl if mp.warmup: incr = uhash mask2 = 0xffffffffl ^ mask while mask2 > mask: if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 mask2 = mask2>>1 # this loop *can* be sped up by tables # with precomputed multiple shifts. # But I'm not sure if it is worth it at all. else: incr = uhash ^ (uhash >> 3) ###### TO HERE if (not incr): incr = mask while 1: mp.trips = mp.trips+1 ep = ep0[int((i+incr)&mask)] if (ep[me_key] is NULL) : if (freeslot is not NULL) : return freeslot else: return ep if (ep[me_key] == dummy) : if (freeslot == NULL): freeslot = ep elif (ep[me_key] == key or (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0)) : return ep # Cycle through GF(2^n)-{0} ###### FROM HERE if mp.oldalg: incr = incr << 1 if (incr > mask): incr = incr ^ mp.ma_poly else: # new algorithm: do a division if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 ###### TO HERE def insertdict(mp, key, _hash, value): me_hash, me_key, me_value = range(3) # rec slots ep = mp.lookdict(key, _hash) if (ep[me_value] is not NULL) : old_value = ep[me_value] ep[me_value] = value else : if (ep[me_key] is NULL): mp.ma_fill=mp.ma_fill+1 ep[me_key] = key ep[me_hash] = _hash ep[me_value] = value mp.ma_used = mp.ma_used+1 def dictresize(mp, minused): me_hash, me_key, me_value = range(3) # rec slots oldsize = mp.ma_size oldtable = mp.ma_table MINSIZE = 4 newsize = MINSIZE for i in range(len(polys)): if (newsize > minused) : newpoly = polys[i] break newsize = newsize << 1 else: return -1 _nullentry = range(3) _nullentry[me_hash] = 0 _nullentry[me_key] = NULL _nullentry[me_value] = NULL newtable = map(lambda x,y=_nullentry:y[:], range(newsize)) mp.ma_size = newsize mp.ma_poly = newpoly mp.ma_table = newtable mp.ma_fill = 0 mp.ma_used = 0 for ep in oldtable: if (ep[me_value] is not NULL): mp.insertdict(ep[me_key],ep[me_hash],ep[me_value]) return 0 # PyDict_GetItem def __getitem__(op, key): me_hash, me_key, me_value = range(3) # rec slots if not op.ma_table: raise KeyError, key _hash = hash(key) return op.lookdict(key, _hash)[me_value] # PyDict_SetItem def __setitem__(op, key, value): mp = op _hash = hash(key) ## /* if fill >= 2/3 size, double in size */ if (mp.ma_fill*3 >= mp.ma_size*2) : if (mp.dictresize(mp.ma_used*2) != 0): if (mp.ma_fill+1 > mp.ma_size): raise MemoryError mp.insertdict(key, _hash, value) # more interface functions def keys(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _key) return res def values(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _value) return res def items(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( (_key, _value) ) return res def __cmp__(self, other): mine = self.items() others = other.items() mine.sort() others.sort() return cmp(mine, others) ###################################################### ## tests def test(lis, dic): for key in lis: dic[key] def nulltest(lis, dic): for key in lis: dic def string_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup for i in range(n): s = str(i) #* 5 #s = chr(i%256) + chr(i>>8)## d1[s] = d2[s] = d3[s] = i return d1, d2, d3 def istring_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup for i in range(n): s = chr(i%256) + chr(i>>8) d1[s] = d2[s] = d3[s] = i return d1, d2, d3 def random_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup from whrandom import randint import sys keys = [] for i in range(n): keys.append(randint(0, sys.maxint-1)) for i in keys: d1[i] = d2[i] = d3[i] = i return d1, d2, d3 def badnum_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup shift = 10 if EXTREME: shift = 16 for i in range(n): bad = i << 16 d2[bad] = d3[bad] = i if n <= 1000: d1[bad] = i return d1, d2, d3 def names_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup import win32con keys = win32con.__dict__.keys() if len(keys) < n: keys = [] for s in keys[:n]: d1[s] = d2[s] = d3[s] = s return d1, d2, d3 def do_test(dict): keys = dict.keys() dict.getTrips() # reset test(keys, dict) return dict.getTrips() EXTREME=1 if __name__ == "__main__": for N in (1000,2000,3000,4000): sdold, sdnew, sdnew2 = string_dicts(N) idold, idnew, idnew2 = istring_dicts(N) bdold, bdnew, bdnew2 = badnum_dicts(N) rdold, rdnew, rdnew2 = random_dicts(N) ndold, ndnew, ndnew2 = names_dicts(N) print "N=%d" %N print "trips for strings old=%d new=%d new2=%d" % tuple( map(do_test, (sdold, sdnew, sdnew2)) ) print "trips for bin strings old=%d new=%d new2=%d" % tuple( map(do_test, (idold, idnew, idnew2)) ) print "trips for bad integers old=%d new=%d new2=%d" % tuple( map(do_test, (bdold, bdnew, bdnew2))) print "trips for random integers old=%d new=%d new2=%d" % tuple( map(do_test, (rdold, rdnew, rdnew2))) print "trips for windows names old=%d new=%d new2=%d" % tuple( map(do_test, (ndold, ndnew, ndnew2))) """ Results with a shift of 10 (EXTREME=0): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.097 new=5.088 timing for bad integers old=101.540 new=12.610 Results with a shift of 16 (EXTREME=1): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.218 new=5.147 timing for bad integers old=571.210 new=19.220 """

Something else to ponder: my tests show that the current ("old") algorithm performs much better (somewhat worse than "new2" == new algorithm + warmup) if incr is simply initialized like so instead: if mp.oldalg: incr = (_hash & 0xffffffffL) % (mp.ma_size - 1) That's another way to get all the bits to contribute to the result. Note that a mod by size-1 is analogous to "casting out nines" in decimal: it's the same as breaking hash into fixed-sized pieces from the right (10 bits each if size=2**10, etc), adding the pieces together, and repeating that process until only one piece remains. IOW, it's a degenerate form of division, but works well all the same. It didn't improve over that when I tried a mod by the largest prime less than the table size (which suggests we're sucking all we can out of the *probe* sequence given a sometimes-poor starting index). However, it's subject to the same weak clustering phenomenon as the old method due to the ill-advised "~hash" operation in computing the initial index. If ~ is also thrown away, it's as good as new2 (here I've tossed out the "windows names", and "old" == existing algorithm except (a) get rid of ~ when computing index and (b) do mod by size-1 when computing incr): N=1000 trips for strings old=230 new=261 new2=239 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=999 new=13212 new2=999 trips for random integers old=399 new=421 new2=410 N=2000 trips for strings old=787 new=1066 new2=827 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=26481 new2=1999 trips for random integers old=652 new=733 new2=650 N=3000 trips for strings old=547 new=760 new2=617 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=38701 new2=2999 trips for random integers old=724 new=743 new2=768 N=4000 trips for strings old=1311 new=1657 new2=1315 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=53014 new2=3999 trips for random integers old=1476 new=1587 new2=1493 The new and new2 values differ in minor ways from the ones you posted because I got rid of the ~ (the ~ has a bad interaction with "additive" means of computing incr, because the ~ tends to move the index in the opposite direction, and these moves in opposite directions tend to cancel out when computing incr+index the first time). too-bad-mod-is-expensive!-ly y'rs - tim

Tim Peters wrote:
Sure. I did this as well, but didn't consider a division since it said to be too slow. But this is very platform dependant. On Pentiums this might be not noticeable.
Again I tried this too. Instead of the largest near prime I used the nearest prime. Remarkably the nearest prime is identical to the primitive element in a lot of cases. But no improvement over the modulus.
...
Remarkable.
too-bad-mod-is-expensive!-ly y'rs - tim
Yes. The wheel is cheapest yet. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Something else to ponder: my tests show that the current ("old") algorithm performs much better (somewhat worse than "new2" == new algorithm + warmup) if incr is simply initialized like so instead: if mp.oldalg: incr = (_hash & 0xffffffffL) % (mp.ma_size - 1) That's another way to get all the bits to contribute to the result. Note that a mod by size-1 is analogous to "casting out nines" in decimal: it's the same as breaking hash into fixed-sized pieces from the right (10 bits each if size=2**10, etc), adding the pieces together, and repeating that process until only one piece remains. IOW, it's a degenerate form of division, but works well all the same. It didn't improve over that when I tried a mod by the largest prime less than the table size (which suggests we're sucking all we can out of the *probe* sequence given a sometimes-poor starting index). However, it's subject to the same weak clustering phenomenon as the old method due to the ill-advised "~hash" operation in computing the initial index. If ~ is also thrown away, it's as good as new2 (here I've tossed out the "windows names", and "old" == existing algorithm except (a) get rid of ~ when computing index and (b) do mod by size-1 when computing incr): N=1000 trips for strings old=230 new=261 new2=239 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=999 new=13212 new2=999 trips for random integers old=399 new=421 new2=410 N=2000 trips for strings old=787 new=1066 new2=827 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=26481 new2=1999 trips for random integers old=652 new=733 new2=650 N=3000 trips for strings old=547 new=760 new2=617 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=38701 new2=2999 trips for random integers old=724 new=743 new2=768 N=4000 trips for strings old=1311 new=1657 new2=1315 trips for bin strings old=0 new=0 new2=0 trips for bad integers old=0 new=53014 new2=3999 trips for random integers old=1476 new=1587 new2=1493 The new and new2 values differ in minor ways from the ones you posted because I got rid of the ~ (the ~ has a bad interaction with "additive" means of computing incr, because the ~ tends to move the index in the opposite direction, and these moves in opposite directions tend to cancel out when computing incr+index the first time). too-bad-mod-is-expensive!-ly y'rs - tim

Tim Peters wrote:
Sure. I did this as well, but didn't consider a division since it said to be too slow. But this is very platform dependant. On Pentiums this might be not noticeable.
Again I tried this too. Instead of the largest near prime I used the nearest prime. Remarkably the nearest prime is identical to the primitive element in a lot of cases. But no improvement over the modulus.
...
Remarkable.
too-bad-mod-is-expensive!-ly y'rs - tim
Yes. The wheel is cheapest yet. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com
participants (2)
-
Christian Tismer
-
Tim Peters