Hash collision security issue (now public)

In http://mail.python.org/pipermail/python-dev/2012-January/115368.html Stefan Behnel wrote:
They SHOULD NOT represent the same content; comparing two strings currently requires converting them to canonical form, which means the smallest format (of those three) that works. If it can be represented in PyUnicode_1BYTE_KIND, then representations using PyUnicode_2BYTE_KIND or PyUnicode_4BYTE_KIND don't count as canonical, won't be created by Python itself, and already compare unequal according to both PyUnicode_RichCompare and stringlib/eq.h (a shortcut used by dicts). That said, I don't think smallest-format is actually enforced with anything stronger than comments (such as in unicodeobject.h struct PyASCIIObject) and asserts (mostly calling _PyUnicode_CheckConsistency). I don't have any insight on how prevalent non-conforming strings will be in practice, or whether supporting their equality will be required as a bugfix. -jJ

On Sun, Jan 8, 2012 at 16:33, Jim Jewett <jimjjewett@gmail.com> wrote:
In http://mail.python.org/pipermail/python-dev/2012-January/115368.html Stefan Behnel wrote:
Can you please configure your mail client to not create new threads like this? As if this topic wasn't already hard enough to follow, it now exists across handfuls of threads with the same title.

Jim Jewett, 08.01.2012 23:33:
That's what I meant. AFAIR, the PEP393 discussions at some point brought up the suspicion that third party code may end up generating Unicode strings that do not comply with that "invariant". So internal code shouldn't strictly rely on it when it deals with user provided data. One example is the "unequal kinds" optimisation in equality comparison, which, if I'm not mistaken, wasn't implemented, due to exactly this reasoning. The same applies to hashing then. Stefan

If you are only Python, you cannot create a string in a non canonical form. If you use the C API, you can create a string in a non canonical form using PyUnicode_New() + PyUnicode_WRITE, or PyUnicode_FromUnicode(NULL, length) (or PyUnicode_FromStringAndSize(NULL, length)) + direct access to the Py_UNICODE* string. If you create strings in a non canonical form, it is a bug in your application and Python doesn't help you. But how could Python help you? Expose a function to check your newly creating string? There is already _PyUnicode_CheckConsistency() which is slow (O(n)) because it checks each character, it is only used in debug mode. Victor

On Sun, Jan 8, 2012 at 16:33, Jim Jewett <jimjjewett@gmail.com> wrote:
In http://mail.python.org/pipermail/python-dev/2012-January/115368.html Stefan Behnel wrote:
Can you please configure your mail client to not create new threads like this? As if this topic wasn't already hard enough to follow, it now exists across handfuls of threads with the same title.

Jim Jewett, 08.01.2012 23:33:
That's what I meant. AFAIR, the PEP393 discussions at some point brought up the suspicion that third party code may end up generating Unicode strings that do not comply with that "invariant". So internal code shouldn't strictly rely on it when it deals with user provided data. One example is the "unequal kinds" optimisation in equality comparison, which, if I'm not mistaken, wasn't implemented, due to exactly this reasoning. The same applies to hashing then. Stefan

If you are only Python, you cannot create a string in a non canonical form. If you use the C API, you can create a string in a non canonical form using PyUnicode_New() + PyUnicode_WRITE, or PyUnicode_FromUnicode(NULL, length) (or PyUnicode_FromStringAndSize(NULL, length)) + direct access to the Py_UNICODE* string. If you create strings in a non canonical form, it is a bug in your application and Python doesn't help you. But how could Python help you? Expose a function to check your newly creating string? There is already _PyUnicode_CheckConsistency() which is slow (O(n)) because it checks each character, it is only used in debug mode. Victor
participants (4)
-
Brian Curtin
-
Jim Jewett
-
Stefan Behnel
-
Victor Stinner