[Python-Dev] Counting collisions for the win

Mon Jan 23 09:53:06 CET 2012

Hello,

I'd still prefer to see a randomized hash()-function (at least for 3.3).

But to protect against the attacks it would be sufficient to use
randomization for collision resolution in dicts (and sets).

What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached.

The benefits:

* protection against the known attacks
* hash(X) stays stable and the same
* dict order is only changed when there are many collisions
* doctests will not break
* enhanced collision resolution
* RNG doesn't have to be initialized in smaller programs
* nearly no slowdown of most dicts
* second hash-function is only used for keys with higher collision-rate
* lower probability to leak secrets
* possibility to use different secrets for each dict

The drawback:

* need to add a second hash-function
* slower than using one hash-function only, when > 20 collisions
* need to add this to container-types? (if used for py3.3)
* need to expose this to the user? (if used for py3.3)
* works only for datatypes with this new function
* possible to implement without breaking ABI?

The following code is meant for explanation purpose only:

for(perturb = hash; ; perturb >>= 5) {
     i = (i << 2) + i + perturb + 1;

     if((collisions++) == 20) {
         // perturb is already zero after 13 rounds.
         // 20 collisions are rare.

         // you can add && (ma_mask > 256) to make 100% sure
         // that it's not used for smaller dicts.

         if(Py_TYPE(key)->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH) {
             // If type has a randomized hash, use this now for lookup
             i = perturb = PyObject_RandomizedHash(key));
         }
    .....

If I got this right we could add a new function "tp_randomized_hash"
to 3.3 release. But can we also add functions in older releases, without
breaking ABI?

If not, can we implement this somehow using a flag?

FOR OLDER RELEASE < 3.3:

Py_hash_t
PyObject_RandomizedHash(PyVarObject *o) {
     PyTypeObject *tp = Py_TYPE(v);
     if(! (tp->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH))
         return -1;

     global_flags_somewhere->USE_RANDOMIZED_HASH = 1;
     return (*tp->tp_hash)(v);
}

.... and in unicodeobject.c: (and wherever we need randomization)

static Py_hash_t
unicode_hash(PyUnicodeObject *self)
{
     Py_ssize_t len;
     Py_UNICODE *p;
     Py_hash_t x;
     Py_hash_t prefix=0;
     Py_hash_t suffix=0;

     if(global_flags_somewhere->USE_RANDOMIZED_HASH) {
         global_flags_somewhere->USE_RANDOMIZED_HASH = 0;
         initialize_rng_if_not_already_done_and_return_seed(&prefix, 
&suffix);

     ..... (and don't cache in this case) .....

It's ugly, but if I understand this correctly, the GIL will
protect us against race-conditions, right?

Hello, internals experts: Would this work or is there a better way to do
this without breaking the ABI?

Frank