[David Abrahams]
Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
[Gordon McMillan]
Jean-Claude Wippler and Christian Tismer did some benchmarks against other implementations. IIRC, the only one in the same ballpark was Lua's (which, IIRC, was faster at least under some conditions).
I'd like to see the benchmark. Like Python, Lua uses a power-of-2 table size, but unlike Python uses linked lists for collisions instead of open addressing. This appears to leave it very vulnerable to bad cases (like using [i << 16 for i in range(20000)] as a set of keys -- Python and Lua both grab the last 15 bits of the ints as their hash codes, which means every key maps to the same hash bucket. Looks like Lua would chain them all together. Python breaks the ties quickly via its collision resolution scrambling.). The Lua string hash appears systematically vulnerable: static unsigned long hash_s (const char *s, size_t l) { unsigned long h = l; /* seed */ size_t step = (l>>5)|1; /* if string is too long, don't hash all its chars */ for (; l>=step; l-=step) h = h ^ ((h<<5)+(h>>2)+(unsigned char)*(s++)); return h; } That hash function would be weak even if it didn't ignore up to 97% of the input characters. OTOH, if it happens not to collide, ignoring up to 97% of the characters eliminates up to 97% of the expense of computing a hash. Etc. Lua's hashes do appear to get a major benefit from lacking a Python feature: user-defined comparisons can (a) raise exceptions, and (b) mutate the hash table *while* you're looking for a key in it. Those cause the Python implementation lots of expensive pain (indeed, the main reason Python has a distinct lookup function for string-keyed dicts is that it doesn't have to choke itself worrying about #a or #b for builtin strings). There's a lovely irony here. Python's dicts are fast because they've been optimized to death. When Lua's dicts are fast, it seems more the case it's because they don't worry much about bad cases. That's *supposed* to be Python's trick <wink>.