[Python-Dev] Algoritmic Complexity Attack on Python

Sat, 31 May 2003 16:21:44 -0400

[Tim]
>> ...
>> Overall, though, I'd welcome a faster string hash, and I agree that
>> Python's isn't particularly zippy.

[Scott Crosby]
> Actually, at least on x86, it is faster than perl. On other platforms,
> it may be somewhat slower.

Perl's isn't particularly zippy either.  I believe that, given heroic coding
effort, a good universal hash designed for speed can get under 1 cycle per
byte hashed on a modern Pentium.  Python's and Perl's string hashes aren't
really in the same ballpark.

> ...
> Yes, I am aware of the problems with the UHASH code. Unfortunately, I
> am not a hash function designer, that code is not mine, and I only use
> it as a black box.
>
> I also consider all code, until verified otherwise, to potentially
> suffer from endianness, alignment, and 32/64 bit issues. Excluding
> alignment issues (which I'm not sure whether to say that its OK to
> fail on strange alignments or not) it has passed *my* self-tests on
> big endian and 64 bit.

Then who's going to vet the code on a Cray T3 (etc, etc, etc, etc)?  This
isn't a nag, it cuts to the heart of what a language like Python can do:
the x-platform behavior of Python's current string hash is easy to
understand, relying only on what the C standard demands.  It's doing (only)
one evil thing, relying on the unspecified (by C) semantics of what happens
when a signed long multiply overflows.  Python runs on just about every
platform on Earth, and that hasn't been a problem so far.  If it becomes a
problem, we can change the accumulator to unsigned long, and then C would
specify what happens.  There ends the exhaustive discusson of all
portability questions about our current code <wink>.

> ...
>> + Some Python apps run for months, and universal hashing doesn't
>>   remove the possibility of quadratic-time behavior.  If I can poke
>>   at a long-running app and observe its behavior, over time I can
>>   deduce a

> I argued on linux-kernel with someone else that this was extremely
> unlikely. It requires the latency of a collision/non-collision being
> noticable over a noisy system, network stack, and system.

So you have in mind only apps accessed across a network?  And then, for
example, discounting the possiblity that a bitter sysadmin opens an
interactive Python shell on the box, prints out a gazillion (i, hash(i))
pairs and mails them to himself for future mischief?

> In almost all cases, for short inputs, the cost of a single L2 cache
> miss far exceeds that of hashing.

If the user is restricted to short inputs, provoking quadratic-time behavior
doesn't matter.

> A more serious danger is an application that leaks actual hash values.

So ever-more operations become "dangerous".  Programmers will never keep
this straight, and Python isn't a straightjacket language.  I still vote
that apps for which this matters use an *appropriate* data structure
instead -- Python isn't an application, it's a language for programming
applications.

> ...
> Agreed, many have realized over the years that hash tables can have
> quadratic behavior in an adversarial environment.

People from real-time backgrounds are much more paranoid than that, and
perhaps the security-conscious could learn a lot from them.  For example,
you're not going to find a dynamic hash table in a real-time app, because
they can't take a chance on bad luck either.  To them, "an adversary" is
just an unlucky roll of the dice, and they can't tolerate it.

> It isn't hidden. Cormen, Lieserson, and Rivest even warn about this in
> their seminal algorithms textbook in 1991.

Knuth warned about it a lot earlier than that <wink -- but see his summary
of hashing in TAoCP, Vol 3 -- it gives strong warnings).

> It *is* obvious when thought of, but the reason I was able to ship out
> so many vulnerability reports yesterday was because few actually *have*
> thought of that deterministic worst-case when writing their programs. I
> predict this trend to continue.

I appreciate that, and I expect it to continue too.  I expect a better
solution would be for more languages to offer a choice of containers with
different O() behaviors.  In C it's hard to make progress because the
standard language comes with so little, and so many "portable" C libraries
aren't.  The C++ world is in better shape, constrained more by the
portability of standard C++ itself.  There's less excuse for Python or Perl
programmers to screw up in these areas, because libraries written in Python
and Perl are very portable, and there are lot of 'em to chose from.

> I like hash tables a lot, with UH, their time bounds are randomized,
> but are pretty tight and the constant factors far exceed those of
> balanced binary trees.

Probably, but have you used a tree implementation into which the same heroic
level of analysis and coding effort has been poured?  The typical portable-C
balanced tree implementation should be viewed as a worst-case bound on how
fast balanced trees can actually work.

Recently, heroic efforts have been poured into Judy tries, which may be both
faster and more memory-efficent than hash tables in many kinds of apps:

    http://judy.sourceforge.net/

The code for Judy tries makes UHASH look trivial, though.  OTOH, provided
the Judy code actually works on your box, and there aren't bugs hiding in
its thousands of lines of difficult code, relying on a Judy "array" for good
worst-case behavior isn't a matter of luck.