[Python-Dev] Algoritmic Complexity Attack on Python

Scott A Crosby scrosby@cs.rice.edu
30 May 2003 18:56:51 -0500

On Fri, 30 May 2003 19:07:38 -0400, Tim Peters <tim.one@comcast.net> writes:

> [Scott Crosby]
> > ...
> > Have you done benchmarking to prove that string_hash is in fact an
> > important hotspot in the python interpreter?
> It depends on the specific application, of course.  The overall speed of
> dicts is crucial, but in many "namespace" dict apps the strings are
> interned, implying (among other things) that a hash code is computed only
> once per string.  In those apps the speed of the string hash function isn't
> so important.  Overall, though, I'd welcome a faster string hash, and I
> agree that Python's isn't particularly zippy.

Actually, at least on x86, it is faster than perl. On other platforms,
it may be somewhat slower.

> OTOH, it's just a couple lines of C that run fine on everything from Palm
> Pilots to Crays, and have created no portability or maintenance headaches.
> Browsing your site, you've got over 100KB of snaky C code to implement
> hashing functions, some with known bugs, others with cautions about open
> questions wrt platforms with different endianness and word sizes than the
> code was initially tested on.  Compared to what Python is using now, that's
> a maintenance nightmare.

Yes, I am aware of the problems with the UHASH code. Unfortunately, I
am not a hash function designer, that code is not mine, and I only use
it as a black box.

I also consider all code, until verified otherwise, to potentially
suffer from endianness, alignment, and 32/64 bit issues. Excluding
alignment issues (which I'm not sure whether to say that its OK to
fail on strange alignments or not) it has passed *my* self-tests on
big endian and 64 bit.

> + Python dicts can be indexed by many kinds of immutable objects.
>   Strings are just one kind of key, and Python has many hash functions.
> + If I understand what you're selling, the hash code of a given string
>   will almost certainly change across program runs.  That's a very
>   visible change in semantics, since hash() is a builtin Python
>   function available to user code.  Some programs use hash codes to
>   index into persistent (file- or database- based) data structures, and
>   such code would plain break if the hash code of a string changed
>   from one run to the next.  I expect the user-visible hash() would have
>   to continue using a predictable function.

The hash has to be keyed upon something. It is possible to store the
key in a file and reuse the same one across all runs. However,
depending on the universal hash function used, leaking pairs of
(input,hash(input)) may allow an attacker to determine the secret key,
and allow attack again. But yeah, preserving these semantics becomes
very messy. The hash-key becomes part of the system state that must be
migrated along with other data that depends on it.

> + Some Python apps run for months, and universal hashing doesn't remove
>   the possibility of quadratic-time behavior.  If I can poke at a
>   long-running app and observe its behavior, over time I can deduce a

I argued on linux-kernel with someone else that this was extremely
unlikely. It requires the latency of a collision/non-collision being
noticable over a noisy system, network stack, and system. In almost
all cases, for short inputs, the cost of a single L2 cache miss far
exceeds that of hashing.

A more serious danger is an application that leaks actual hash values.

> If it's mod by a prime, probably.

I'd benchmark it in practice, microbenchmarking on a sparc says that
it is rather expensive. However, on an X86, the cost of an L2 cache
miss exceeds the cost of hashing a small string. You'd have a better
idea what impact this might have on the total runtime of the system in
the worst case.

> > then you may wish to consider Jenkin's hash. The linux kernel people
> > are switching to using a keyed veriant of Jenkin's hash. However,
> > Jenkin's, AFAIK, has no proofs that it is in fact universal. It,
> > however, probably is safe.
> Nobody writing a Python program *has* to use a dict.  That dicts have
> quadratic-time worst-case behavior isn't hidden, and there's no cure for

Agreed, many have realized over the years that hash tables can have
quadratic behavior in an adversarial environment. It isn't
hidden. Cormen, Lieserson, and Rivest even warn about this in their
seminal algorithms textbook in 1991. It *is* obvious when thought of,
but the reason I was able to ship out so many vulnerability reports
yesterday was because few actually *have* thought of that
deterministic worst-case when writing their programs. I predict this
trend to continue.

I like hash tables a lot, with UH, their time bounds are randomized,
but are pretty tight and the constant factors far exceed those of
balanced binary trees.