[Python-Dev] Hash collision security issue (now public)

Mon Jan 9 09:13:15 CET 2012

Jim Jewett, 08.01.2012 23:33:
> Stefan Behnel wrote:
>> Admittedly, this may require some adaptation for the PEP393 unicode memory
>> layout in order to produce identical hashes for all three representations
>> if they represent the same content.
> 
> They SHOULD NOT represent the same content; comparing two strings
> currently requires converting them to canonical form, which means the
> smallest format (of those three) that works.
> [...]
> That said, I don't think smallest-format is actually enforced with
> anything stronger than comments (such as in unicodeobject.h struct
> PyASCIIObject) and asserts (mostly calling
> _PyUnicode_CheckConsistency).

That's what I meant. AFAIR, the PEP393 discussions at some point brought up
the suspicion that third party code may end up generating Unicode strings
that do not comply with that "invariant". So internal code shouldn't
strictly rely on it when it deals with user provided data. One example is
the "unequal kinds" optimisation in equality comparison, which, if I'm not
mistaken, wasn't implemented, due to exactly this reasoning. The same
applies to hashing then.

Stefan