[Scott Crosby]
... Have you done benchmarking to prove that string_hash is in fact an important hotspot in the python interpreter?
It depends on the specific application, of course. The overall speed of dicts is crucial, but in many "namespace" dict apps the strings are interned, implying (among other things) that a hash code is computed only once per string. In those apps the speed of the string hash function isn't so important. Overall, though, I'd welcome a faster string hash, and I agree that Python's isn't particularly zippy. OTOH, it's just a couple lines of C that run fine on everything from Palm Pilots to Crays, and have created no portability or maintenance headaches. Browsing your site, you've got over 100KB of snaky C code to implement hashing functions, some with known bugs, others with cautions about open questions wrt platforms with different endianness and word sizes than the code was initially tested on. Compared to what Python is using now, that's a maintenance nightmare. Note that Python's hash API doesn't return 32 bits, it returns a hash code of the same size as the native C long. The multiplication gimmick doesn't require any pain to do that. Other points that arise in practical deployment: + Python dicts can be indexed by many kinds of immutable objects. Strings are just one kind of key, and Python has many hash functions. + If I understand what you're selling, the hash code of a given string will almost certainly change across program runs. That's a very visible change in semantics, since hash() is a builtin Python function available to user code. Some programs use hash codes to index into persistent (file- or database- based) data structures, and such code would plain break if the hash code of a string changed from one run to the next. I expect the user-visible hash() would have to continue using a predictable function. + Some Python apps run for months, and universal hashing doesn't remove the possibility of quadratic-time behavior. If I can poke at a long-running app and observe its behavior, over time I can deduce a set of keys that collide badly for any hashing scheme fixed when the program starts. In that sense I don't believe this gimmick wholly plugs the hole it's trying to plug.
If so, and doing one modulo operation per string is unacceptable,
If it's mod by a prime, probably. Some architectures Python runs on require hundreds of cycles to do an integer mod, and we don't have the resources to construct custom mod-by-an-int shortcut code for dozens of obscure architectures.
then you may wish to consider Jenkin's hash. The linux kernel people are switching to using a keyed veriant of Jenkin's hash. However, Jenkin's, AFAIK, has no proofs that it is in fact universal. It, however, probably is safe.
Nobody writing a Python program *has* to use a dict. That dicts have quadratic-time worst-case behavior isn't hidden, and there's no cure for that short of switching to a data structure with better worst-case bounds. I certainly agree it's something for programmers to be aware of. I still don't see any reason for the core language to care about this, though.