[Python-Dev] Counting collisions w/ no need for a fatal exception

Wed Jan 25 06:24:31 CET 2012

On Sun, Jan 22, 2012 at 10:41 PM, Tim Delaney
<timothy.c.delaney at gmail.com> wrote:
> On 23 January 2012 16:49, Lennart Regebro <regebro at gmail.com> wrote:
>>
>> On Mon, Jan 23, 2012 at 06:02, Paul McMillan <paul at mcmillan.ws> wrote:
>> >> We may use a different salt per dictionary.
>> >
>> > If we're willing to re-hash everything on a per-dictionary basis. That
>> > doesn't seem reasonable given our existing usage.
>>
>> Well, if we get crazy amounts of collisions, re-hashing with a new
>> salt to get rid of those collisions seems quite reasonable to me...
>
>
> Actually, this looks like it has the seed of a solution in it. I haven't
> scrutinised the following beyond "it sounds like it could work" - it could
> well contain nasty flaws.
>
> Assumption: We only get an excessive number of collisions during an attack
> (directly or indirectly).
> Assumption: Introducing a salt into hashes will change those hashes
> sufficiently to mitigate the attack (all discussion of randomising hashes
> makes this assumption).
>
> 1. Keep the current hashing (for all dictionaries) i.e. just using
> hash(key).
>
> 2. Count collisions.
>
> 3. If any key hits X collisions change that dictionary to use a random salt
> for hashes (at least for str and unicode keys). This salt would be
> remembered for the dictionary.
>
> Consequence: The dictionary would need to be rebuilt when an attack was
> detected.
> Consequence: Hash caching would no longer occur for this dictionary, making
> most operations more expensive.
> Consequence: Anything relying on the iteration order of a dictionary which
> has suffered excessive conflicts would fail.

+1

I like this!  The dictionary would still be O(n) but the constant cost
in front of that just went up.  When you are dealing with keys coming
in from outside of the process, those are unlikely to already have any
hash values so the constant cost at insertion time has really not
changed at all because they would need hashing anyways. Their cost at
non-iteration lookup time will be a constant factor greater but I do
not see that as being a problem given that known keys being looked up
in a

This approach also allows for the dictionary hashing mode switch to
occur after a lower number of collisions than the previous
investigations into raising a MemoryError or similar were asking for
(because they wanted to avoid false hard failures).  It prevents that
case from breaking in favor of a brief performance hiccup.

I would *combine* this with a per process/interpreter-instance seed in
3.3 and later for added impact (less need for this code path to ever
be triggered).  For the purposes of backporting as a security fix,
that part would be disabled by default but #1-3 would be enabled by
default.

Question A: Does the dictionary get rebuilt -again- with a new
dict-salt if a large number of collisions occurs after a dict-salt has
already been established?

Question B: Is there a size of dictionary in which we refuse to
rebuild & rehash it because it would simply be too costly?  obviously
if we lack the ram to malloc a new table, when else?  ever?

Suggestion: Would there be any benefit to making the number of
collisions threshold on when to rebuild & rehash a log function of the
dictionary's current size rather than a constant for all dicts?

>
> 4. (Optional) in 3.3, provide a way to get a dictionary with random salt
> (i.e. not wait for attack).

I don't like #4 as a documented public API as I'm not sure how well
that'd play with other VMs (I suppose they could ignore it) but it
would be useful for dict implementation testing purposes and easier
studying of the behavior.  If this is added it should be a method on
the dict such as ._set_hash_salt() or something and for testing
purposes it would be good to allow a dictionary to be queried to see
if they are using their own salt or not (perhaps just
._get_hash_salt() returning non 0 means they are?)

-gps