
[Tim]
... Overall, though, I'd welcome a faster string hash, and I agree that Python's isn't particularly zippy.
[Scott Crosby]
Actually, at least on x86, it is faster than perl. On other platforms, it may be somewhat slower.
Perl's isn't particularly zippy either. I believe that, given heroic coding effort, a good universal hash designed for speed can get under 1 cycle per byte hashed on a modern Pentium. Python's and Perl's string hashes aren't really in the same ballpark.
... Yes, I am aware of the problems with the UHASH code. Unfortunately, I am not a hash function designer, that code is not mine, and I only use it as a black box.
I also consider all code, until verified otherwise, to potentially suffer from endianness, alignment, and 32/64 bit issues. Excluding alignment issues (which I'm not sure whether to say that its OK to fail on strange alignments or not) it has passed *my* self-tests on big endian and 64 bit.
Then who's going to vet the code on a Cray T3 (etc, etc, etc, etc)? This isn't a nag, it cuts to the heart of what a language like Python can do: the x-platform behavior of Python's current string hash is easy to understand, relying only on what the C standard demands. It's doing (only) one evil thing, relying on the unspecified (by C) semantics of what happens when a signed long multiply overflows. Python runs on just about every platform on Earth, and that hasn't been a problem so far. If it becomes a problem, we can change the accumulator to unsigned long, and then C would specify what happens. There ends the exhaustive discusson of all portability questions about our current code <wink>.
...
+ Some Python apps run for months, and universal hashing doesn't remove the possibility of quadratic-time behavior. If I can poke at a long-running app and observe its behavior, over time I can deduce a
I argued on linux-kernel with someone else that this was extremely unlikely. It requires the latency of a collision/non-collision being noticable over a noisy system, network stack, and system.
So you have in mind only apps accessed across a network? And then, for example, discounting the possiblity that a bitter sysadmin opens an interactive Python shell on the box, prints out a gazillion (i, hash(i)) pairs and mails them to himself for future mischief?
In almost all cases, for short inputs, the cost of a single L2 cache miss far exceeds that of hashing.
If the user is restricted to short inputs, provoking quadratic-time behavior doesn't matter.
A more serious danger is an application that leaks actual hash values.
So ever-more operations become "dangerous". Programmers will never keep this straight, and Python isn't a straightjacket language. I still vote that apps for which this matters use an *appropriate* data structure instead -- Python isn't an application, it's a language for programming applications.
... Agreed, many have realized over the years that hash tables can have quadratic behavior in an adversarial environment.
People from real-time backgrounds are much more paranoid than that, and perhaps the security-conscious could learn a lot from them. For example, you're not going to find a dynamic hash table in a real-time app, because they can't take a chance on bad luck either. To them, "an adversary" is just an unlucky roll of the dice, and they can't tolerate it.
It isn't hidden. Cormen, Lieserson, and Rivest even warn about this in their seminal algorithms textbook in 1991.
Knuth warned about it a lot earlier than that <wink -- but see his summary of hashing in TAoCP, Vol 3 -- it gives strong warnings).
It *is* obvious when thought of, but the reason I was able to ship out so many vulnerability reports yesterday was because few actually *have* thought of that deterministic worst-case when writing their programs. I predict this trend to continue.
I appreciate that, and I expect it to continue too. I expect a better solution would be for more languages to offer a choice of containers with different O() behaviors. In C it's hard to make progress because the standard language comes with so little, and so many "portable" C libraries aren't. The C++ world is in better shape, constrained more by the portability of standard C++ itself. There's less excuse for Python or Perl programmers to screw up in these areas, because libraries written in Python and Perl are very portable, and there are lot of 'em to chose from.
I like hash tables a lot, with UH, their time bounds are randomized, but are pretty tight and the constant factors far exceed those of balanced binary trees.
Probably, but have you used a tree implementation into which the same heroic level of analysis and coding effort has been poured? The typical portable-C balanced tree implementation should be viewed as a worst-case bound on how fast balanced trees can actually work. Recently, heroic efforts have been poured into Judy tries, which may be both faster and more memory-efficent than hash tables in many kinds of apps: http://judy.sourceforge.net/ The code for Judy tries makes UHASH look trivial, though. OTOH, provided the Judy code actually works on your box, and there aren't bugs hiding in its thousands of lines of difficult code, relying on a Judy "array" for good worst-case behavior isn't a matter of luck.