
Sounds good to me! It's a very cheap way to get the high bits into play.
i = (~_hash) & mask
The ~ here seems like pure superstition to me (and the comments in the C code don't justify it at all -- I added a nag of my own about that the last time I checked in dictobject.c -- and see below for a bad consequence of doing ~).
# note that we do not mask! # even the shifting my not be worth it. incr = _hash ^ (_hash >> 3)
The shifting was just another cheap trick to get *some* influence from the high bits. It's very limited, though. Toss it (it appears to be from the "random operations yield random results" <wink/sigh> matchbook school of design). [MAL]
BTW, would changing the hash function on strings from the simple XOR scheme to something a little smarter help improve the performance too (e.g. most strings used in programming never use the 8-th bit) ?
Don't understand -- the string hash uses multiplication: x = (1000003*x) ^ *p++; in a loop. Replacing "^" there by "+" should yield slightly better results. As is, string hashes are a lot like integer hashes, in that "consecutive" strings J001 J002 J003 J004 ... yield hashes very close together in value. But, because the current dict algorithm uses ~ on the full hash but does not use ~ on the initial increment, (~hash)+incr too often yields the same result for distinct hashes (i.e., there's a systematic (but weak) form of clustering). Note that Python is doing something very unusual: hashes are *usually* designed to yield an approximation to randomness across all bits. But Python's hashes never achieve that. This drives theoreticians mad (like the fellow who originally came up with the GF idea), but tends to work "better than random" in practice (e.g., a truly random hash function would almost certainly produce many collisions when fed a fat range of consecutive integers but still less than half the table size; but Python's trivial "identity" integer hash produces no collisions in that common case). [Christian]
- The bits used from the string hash are not well distributed - using a "warmup wheel" on the hash to suck all bits in gives the same quality of hashes like random numbers.
See above and be very cautious: none of Python's hash functions produce well-distributed bits, and-- in effect --that's why Python dicts often perform "better than random" on common data. Even what you've done so far appears to provide marginally worse statistics for Guido's favorite kind of test case ("worse" in two senses: total number of collisions (a measure of amortized lookup cost), and maximum collision chain length (a measure of worst-case lookup cost)): d = {} for i in range(N): d[repr(i)] = i check-in-one-thing-then-let-it-simmer-ly y'rs - tim