[Python-Dev] RE: The Dictionary Gem is polished!

Tim Peters tim_one@email.msn.com
Tue, 19 Dec 2000 04:24:18 -0500


[Christian Tismer]
> ...
> For short strings, this prime has bad influence on the low bits,
> making it perform supoptimally for small dicts.
> See the new2 algo which funnily corrects for that.
> The reason is obvious: Just look at the bit pattern
> of 1000003:  '0xf4243'
>
> Without giving proof, this smells like bad bit distribution on small
> strings to me. You smell it too, right?
> ...

[Tim]
> As is, string hashes are a lot like integer hashes, in that
> "consecutive" strings
>
>    J001
>    J002
>    J003
>    J004
>    ...
>
> yield hashes very close together in value.

[back to Christian]
> A bad generator in that case. I'll look for a better one.

Not necessarily!  It's for that same reason "consecutive strings" can have
"better than random" behavior today.  And consecutive strings-- like
consecutive ints --are a common case.

Here are the numbers for the synthesized string cases:

N=1000
trips for strings old=293 new=302 new2=221
trips for bin strings old=0 new=0 new2=0
N=2000
trips for strings old=1093 new=1109 new2=786
trips for bin strings old=0 new=0 new2=0
N=3000
trips for strings old=810 new=839 new2=609
trips for bin strings old=0 new=0 new2=0
N=4000
trips for strings old=1850 new=1843 new2=1375
trips for bin strings old=0 new=0 new2=0

Here they are again, after doing nothing except changing the "^" to "+" in
the string hash, i.e. replacing

		x = (1000003*x) ^ *p++;
by
		x = (1000003*x) + *p++;

N=1000
trips for strings old=140 new=127 new2=108
trips for bin strings old=0 new=0 new2=0
N=2000
trips for strings old=480 new=434 new2=411
trips for bin strings old=0 new=0 new2=0
N=3000
trips for strings old=821 new=857 new2=631
trips for bin strings old=0 new=0 new2=0
N=4000
trips for strings old=1892 new=1852 new2=1476
trips for bin strings old=0 new=0 new2=0

The first two sizes are dramatically better, the last two a wash.  If you
want to see a real disaster, replace the "+" with "*" <wink>:

N=1000
trips for strings old=71429 new=6449 new2=2048
trips for bin strings old=81187 new=41117 new2=41584
N=2000
trips for strings old=26882 new=9300 new2=6103
trips for bin strings old=96018 new=46932 new2=42408

I got tired of waiting at that point ...

suspecting-a-better-string-hash-is-hard-to-find-ly y'rs  - tim