
Again... Tim Peters wrote:
Sounds good to me! It's a very cheap way to get the high bits into play.
...
[Christian]
- The bits used from the string hash are not well distributed - using a "warmup wheel" on the hash to suck all bits in gives the same quality of hashes like random numbers.
See above and be very cautious: none of Python's hash functions produce well-distributed bits, and-- in effect --that's why Python dicts often perform "better than random" on common data. Even what you've done so far appears to provide marginally worse statistics for Guido's favorite kind of test case ("worse" in two senses: total number of collisions (a measure of amortized lookup cost), and maximum collision chain length (a measure of worst-case lookup cost)):
d = {} for i in range(N): d[repr(i)] = i
I will look into this.
check-in-one-thing-then-let-it-simmer-ly y'rs - tim
Are you saying I should check the thing in? Really? In another reply to this message I was saying """ This is why I think to be even more conservative: Try to use a division wheel, but with the inverses of the original primitive roots, just in order to get at Guido's results :-) """ This was a religious desire, but such an inverse cannot exist. Well, all inverses exists, but it is an error to think that they can produce similar bit patterns. Changing the root means changing the whole system, since we have just a *representation* of a goup, via polynomial coefficients. A simple example which renders my thought useless is this: There is no general pattern that can turn a physical right shift into a left shift, for all bit combinations. Anyway, how can I produce a nearly complete scheme like today with the same "cheaper than random" properties? Ok, we have to stick with the given polymomials to stay compatible, and we also have to shift left. How do we then rotate the random bits in? Well, we can in fact do a rotation of the whole index, moving the highest bit into the lowest. Too bad that this isn't supported in C. It is a native machine instruction on X86 machines. We would then have: incr = ROTATE_LEFT(incr, 1) if (incr > mask): incr = incr ^ mp.ma_poly The effect is similar to the "old" algorithm, bits are shiftet left. Only if the hash happens to have hight bits, they appear in the modulus. On the current "faster than random" cases, I assume that high bits in the hash are less likely than low bits, so it is more likely that an entry finds its good place in the dict, before bits are rotated in. hence the "good" cases would be kept. I did all tests again, now including maximum trip length, and added a "rotate-left" version as well: D:\crml_doc\platf\py>python dictest.py N=1000 trips for strings old=293/9 new=302/7 new2=221/7 rot=278/5 trips for bad integers old=499500/999 new=13187/31 new2=999/1 rot=16754/31 trips for random integers old=360/8 new=369/8 new2=358/6 rot=356/7 trips for windows names old=230/5 new=207/7 new2=200/5 rot=225/5 N=2000 trips for strings old=1093/11 new=1109/10 new2=786/6 rot=1082/8 trips for bad integers old=0/0 new=26455/32 new2=1999/1 rot=33524/34 trips for random integers old=704/7 new=686/8 new2=685/7 rot=693/7 trips for windows names old=503/8 new=542/9 new2=564/6 rot=529/7 N=3000 trips for strings old=810/5 new=839/6 new2=609/5 rot=796/5 trips for bad integers old=0/0 new=38681/36 new2=2999/1 rot=49828/38 trips for random integers old=708/5 new=723/7 new2=724/5 rot=722/6 trips for windows names old=712/6 new=711/5 new2=691/5 rot=738/9 N=4000 trips for strings old=1850/9 new=1843/8 new2=1375/11 rot=1848/10 trips for bad integers old=0/0 new=52994/39 new2=3999/1 rot=66356/38 trips for random integers old=1395/9 new=1397/8 new2=1435/9 rot=1394/13 trips for windows names old=1449/8 new=1434/8 new2=1457/11 rot=1513/9 D:\crml_doc\platf\py> Concerning trip length, rotate is better than old in most cases. Random integers seem to withstand any of these procedures. For bad integers, rot takes naturally more trips than new, since the path to the bits is longer. All in all I don't see more than marginal differences between the approaches, and I tent to stick with "new", since it is theapest to implement. (it does not cost anything and might instead be a little cheaper for some compilers, since it does not reference the mask variable). I'd say let's do the patch -- ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com ## dictest.py ## Test of a new rehash algorithm ## Chris Tismer ## 2000-12-17 ## Mission Impossible 5oftware Team # The following is a partial re-implementation of # Python dictionaries in Python. # The original algorithm was literally turned # into Python code. ##/* ##Table of irreducible polynomials to efficiently cycle through ##GF(2^n)-{0}, 2<=n<=30. ##*/ polys = [ 4 + 3, 8 + 3, 16 + 3, 32 + 5, 64 + 3, 128 + 3, 256 + 29, 512 + 17, 1024 + 9, 2048 + 5, 4096 + 83, 8192 + 27, 16384 + 43, 32768 + 3, 65536 + 45, 131072 + 9, 262144 + 39, 524288 + 39, 1048576 + 9, 2097152 + 5, 4194304 + 3, 8388608 + 33, 16777216 + 27, 33554432 + 9, 67108864 + 71, 134217728 + 39, 268435456 + 9, 536870912 + 5, 1073741824 + 83, 0 ] polys = map(long, polys) class NULL: pass class Dictionary: dummy = "<dummy key>" def __init__(mp, newalg=0): mp.ma_size = 0 mp.ma_poly = 0 mp.ma_table = [] mp.ma_fill = 0 mp.ma_used = 0 mp.oldalg = not newalg mp.warmup = newalg==2 mp.rotleft = newalg==3 mp.trips = 0 mp.tripmax = 0 def getTrips(self): trips, tripmax = self.trips, self.tripmax self.trips = self.tripmax = 0 return trips, tripmax def lookdict(mp, key, _hash): me_hash, me_key, me_value = range(3) # rec slots dummy = mp.dummy mask = mp.ma_size-1 ep0 = mp.ma_table i = (~_hash) & mask ep = ep0[i] if ep[me_key] is NULL or ep[me_key] == key: return ep if ep[me_key] == dummy: freeslot = ep else: if (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0) : return ep freeslot = NULL ###### FROM HERE if mp.oldalg: incr = (_hash ^ (_hash >> 3)) & mask else: # note that we do not mask! # the shifting is worth it in the incremental case. ## added after posting to python-dev: uhash = _hash & 0xffffffffl if mp.warmup: incr = uhash mask2 = 0xffffffffl ^ mask while mask2 > mask: if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 mask2 = mask2>>1 # this loop *can* be sped up by tables # with precomputed multiple shifts. # But I'm not sure if it is worth it at all. else: incr = uhash ^ (uhash >> 3) ###### TO HERE if (not incr): incr = mask triplen = 0 while 1: mp.trips = mp.trips+1 triplen = triplen+1 if triplen > mp.tripmax: mp.tripmax = triplen ep = ep0[int((i+incr)&mask)] if (ep[me_key] is NULL) : if (freeslot is not NULL) : return freeslot else: return ep if (ep[me_key] == dummy) : if (freeslot == NULL): freeslot = ep elif (ep[me_key] == key or (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0)) : return ep # Cycle through GF(2^n)-{0} ###### FROM HERE if mp.oldalg: incr = incr << 1 if (incr > mask): incr = incr ^ mp.ma_poly elif mp.rotleft: if incr &0x80000000L: incr = (incr << 1) | 1 else: incr = incr << 1 if (incr > mask): incr = incr ^ mp.ma_poly else: # new algorithm: do a division if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 ###### TO HERE def insertdict(mp, key, _hash, value): me_hash, me_key, me_value = range(3) # rec slots ep = mp.lookdict(key, _hash) if (ep[me_value] is not NULL) : old_value = ep[me_value] ep[me_value] = value else : if (ep[me_key] is NULL): mp.ma_fill=mp.ma_fill+1 ep[me_key] = key ep[me_hash] = _hash ep[me_value] = value mp.ma_used = mp.ma_used+1 def dictresize(mp, minused): me_hash, me_key, me_value = range(3) # rec slots oldsize = mp.ma_size oldtable = mp.ma_table MINSIZE = 4 newsize = MINSIZE for i in range(len(polys)): if (newsize > minused) : newpoly = polys[i] break newsize = newsize << 1 else: return -1 _nullentry = range(3) _nullentry[me_hash] = 0 _nullentry[me_key] = NULL _nullentry[me_value] = NULL newtable = map(lambda x,y=_nullentry:y[:], range(newsize)) mp.ma_size = newsize mp.ma_poly = newpoly mp.ma_table = newtable mp.ma_fill = 0 mp.ma_used = 0 for ep in oldtable: if (ep[me_value] is not NULL): mp.insertdict(ep[me_key],ep[me_hash],ep[me_value]) return 0 # PyDict_GetItem def __getitem__(op, key): me_hash, me_key, me_value = range(3) # rec slots if not op.ma_table: raise KeyError, key _hash = hash(key) return op.lookdict(key, _hash)[me_value] # PyDict_SetItem def __setitem__(op, key, value): mp = op _hash = hash(key) ## /* if fill >= 2/3 size, double in size */ if (mp.ma_fill*3 >= mp.ma_size*2) : if (mp.dictresize(mp.ma_used*2) != 0): if (mp.ma_fill+1 > mp.ma_size): raise MemoryError mp.insertdict(key, _hash, value) # more interface functions def keys(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _key) return res def values(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _value) return res def items(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( (_key, _value) ) return res def __cmp__(self, other): mine = self.items() others = other.items() mine.sort() others.sort() return cmp(mine, others) ###################################################### ## tests def test(lis, dic): for key in lis: dic[key] def nulltest(lis, dic): for key in lis: dic def string_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft for i in range(n): s = str(i) #* 5 #s = chr(i%256) + chr(i>>8)## d1[s] = d2[s] = d3[s] = d4[s] = i return d1, d2, d3, d4 def istring_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft for i in range(n): s = chr(i%256) + chr(i>>8) d1[s] = d2[s] = d3[s] = d4[s] = i return d1, d2, d3, d4 def random_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft from whrandom import randint import sys keys = [] for i in range(n): keys.append(randint(0, sys.maxint-1)) for i in keys: d1[i] = d2[i] = d3[i] = d4[i] = i return d1, d2, d3, d4 def badnum_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft shift = 10 if EXTREME: shift = 16 for i in range(n): bad = i << 16 d2[bad] = d3[bad] = d4[bad] = i if n <= 1000: d1[bad] = i return d1, d2, d3, d4 def names_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft import win32con keys = win32con.__dict__.keys() if len(keys) < n: keys = [] for s in keys[:n]: d1[s] = d2[s] = d3[s] = d4[s] = s return d1, d2, d3, d4 def do_test(dict): keys = dict.keys() dict.getTrips() # reset test(keys, dict) return "%d/%d" % dict.getTrips() EXTREME=1 if __name__ == "__main__": for N in (1000,2000,3000,4000): sdold, sdnew, sdnew2, sdrot = string_dicts(N) #idold, idnew, idnew2, idrot = istring_dicts(N) bdold, bdnew, bdnew2, bdrot = badnum_dicts(N) rdold, rdnew, rdnew2, rdrot = random_dicts(N) ndold, ndnew, ndnew2, ndrot = names_dicts(N) fmt = "old=%s new=%s new2=%s rot=%s" print "N=%d" %N print ("trips for strings "+fmt) % tuple( map(do_test, (sdold, sdnew, sdnew2, sdrot)) ) #print ("trips for bin strings "+fmt) % tuple( # map(do_test, (idold, idnew, idnew2, idrot)) ) print ("trips for bad integers "+fmt) % tuple( map(do_test, (bdold, bdnew, bdnew2, bdrot))) print ("trips for random integers "+fmt) % tuple( map(do_test, (rdold, rdnew, rdnew2, rdrot))) print ("trips for windows names "+fmt) % tuple( map(do_test, (ndold, ndnew, ndnew2, ndrot))) """ Results with a shift of 10 (EXTREME=0): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.097 new=5.088 timing for bad integers old=101.540 new=12.610 Results with a shift of 16 (EXTREME=1): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.218 new=5.147 timing for bad integers old=571.210 new=19.220 """