[issue13703] Hash collision security issue

Wed Jan 18 23:52:46 CET 2012

STINNER Victor <victor.stinner at haypocalc.com> added the comment:

> Don't you think that the number of corrections you have to apply in order
> to get the tests working again shows how much impact such a change would
> have in real-world applications ?

Let see the diffstat:

 Doc/using/cmdline.rst                       |    7
 Include/pythonrun.h                         |    2
 Include/unicodeobject.h                     |    6
 Lib/json/__init__.py                        |    4
 Lib/os.py                                   |   17 -
 Lib/packaging/create.py                     |    7
 Lib/packaging/tests/test_create.py          |   18 -
 Lib/test/mapping_tests.py                   |    2
 Lib/test/regrtest.py                        |    5
 Lib/test/test_builtin.py                    |    1
 Lib/test/test_dis.py                        |   36 ++-
 Lib/test/test_gdb.py                        |   11 -
 Lib/test/test_inspect.py                    |    1
 Lib/test/test_os.py                         |   35 ++-
 Lib/test/test_set.py                        |   25 ++
 Lib/test/test_unicode.py                    |   39 ++++
 Lib/test/test_urllib.py                     |   16 -
 Lib/test/test_urlparse.py                   |    6
 Lib/tkinter/test/test_ttk/test_functions.py |    2
 Makefile.pre.in                             |    1
 Modules/posixmodule.c                       |  126 ++-----------
 Objects/unicodeobject.c                     |   20 +-
 PCbuild/pythoncore.vcproj                   |    4
 Python/pythonrun.c                          |    3
 Python/random.c                             |  268 ++++++++++++++++++++++++++++
 25 files changed, 488 insertions(+), 174 deletions(-)

Except Lib/packaging/create.py, all other changes are related to the
introduction of the randomized hash function, or fix tests... Even
Lib/packaging/create.py change is related to fixing tests. The test
can be changed differently, but I like the idea of having always the
same output in packaging (e.g. it is more readable for the user if
files are sorted).

I expected to have to do something on multiprocessing, but nope, it
doesn't care of the hash value.

So I expect something similar in applications: no change in the
applications, but a lot of hacks/tricks in tests.

> Perhaps we should start to think about a compromise: make both the
> collision counting and the hash seeding optional and let the user
> decide which option is best.

I don't think that we need two fixes for a single vulnerability (in
the same Python version), one is enough. If we decide to count
collisions, the randomized hash idea can be simply dropped. But we may
use a different fix for Python 3.3 and for stable versions (e.g. count
collisions for stable versions and use randomized hash for 3.3).

> BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix
> which needlessly complicates the code and doesn't any additional
> protection against hash value collisions

How does it complicate the code? It adds an extra XOR to hash(str) and
4 or 8 bytes in memory, that's all. It is more difficult to compute
the secret from hash(str) output if there is a prefix *and* a suffix.
If there is only a prefix, knowning a single hash(str) value is just
enough to retrieve directly the secret.
.
> I don't think it affects more than 0.01% of applications/users :)

It would help to try a patched Python on a real world application like
Django to realize how much code is broken (or not) by a randomized
hash function.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13703>
_______________________________________