
[Christian Tismer]
... For short strings, this prime has bad influence on the low bits, making it perform supoptimally for small dicts. See the new2 algo which funnily corrects for that. The reason is obvious: Just look at the bit pattern of 1000003: '0xf4243'
Without giving proof, this smells like bad bit distribution on small strings to me. You smell it too, right? ...
[Tim]
As is, string hashes are a lot like integer hashes, in that "consecutive" strings
J001 J002 J003 J004 ...
yield hashes very close together in value.
[back to Christian]
A bad generator in that case. I'll look for a better one.
Not necessarily! It's for that same reason "consecutive strings" can have "better than random" behavior today. And consecutive strings-- like consecutive ints --are a common case. Here are the numbers for the synthesized string cases: N=1000 trips for strings old=293 new=302 new2=221 trips for bin strings old=0 new=0 new2=0 N=2000 trips for strings old=1093 new=1109 new2=786 trips for bin strings old=0 new=0 new2=0 N=3000 trips for strings old=810 new=839 new2=609 trips for bin strings old=0 new=0 new2=0 N=4000 trips for strings old=1850 new=1843 new2=1375 trips for bin strings old=0 new=0 new2=0 Here they are again, after doing nothing except changing the "^" to "+" in the string hash, i.e. replacing x = (1000003*x) ^ *p++; by x = (1000003*x) + *p++; N=1000 trips for strings old=140 new=127 new2=108 trips for bin strings old=0 new=0 new2=0 N=2000 trips for strings old=480 new=434 new2=411 trips for bin strings old=0 new=0 new2=0 N=3000 trips for strings old=821 new=857 new2=631 trips for bin strings old=0 new=0 new2=0 N=4000 trips for strings old=1892 new=1852 new2=1476 trips for bin strings old=0 new=0 new2=0 The first two sizes are dramatically better, the last two a wash. If you want to see a real disaster, replace the "+" with "*" <wink>: N=1000 trips for strings old=71429 new=6449 new2=2048 trips for bin strings old=81187 new=41117 new2=41584 N=2000 trips for strings old=26882 new=9300 new2=6103 trips for bin strings old=96018 new=46932 new2=42408 I got tired of waiting at that point ... suspecting-a-better-string-hash-is-hard-to-find-ly y'rs - tim