[Python-Dev] Counting collisions for the win

Glenn Linderman v+python at g.nevcal.com
Mon Jan 23 19:25:43 CET 2012

On 1/23/2012 12:53 AM, Frank Sievertsen wrote:
> What if we use a second (randomized) hash-function in case there
> are many collisions in ONE lookup. This hash-function is used only
> for collision resolution and is not cached. 

So this sounds like SafeDict, but putting it under the covers and 
automatically converting from dict to SafeDict after a collision 
threshold has been reached.  Let's call it fallback-dict.

Compared to SafeDict as a programmer tool, fallback-dict has these benefits:

* No need to change program (or library) source to respond to an attack
* Order is preserved until the collision threshold has been reached
* Performance is preserved until the collision threshold has been reached

and costs:

* converting the dict from one hash to the other by rehashing all the keys.

Compared to always randomizing the hash, fallback-dict has these benefits:

* hash (and perfomance) is deterministic: programs running on the same 
data set will have the same performance characteristic, unless the 
collision threshold is reached for that data set.
* lower probability to leak secrets, because each attacked set/dict can 
have its own secret, randomized hash seed
* patch would not need to include RNG initialization during startup, 
lowering the impact on short-running programs.

What is not clear is how much SafeDict degrades performance when it is 
used; non-cached hashes will definitely have an impact.  I'm not sure 
whether an implementation of fallback-dict in C code, would be 
significantly faster than the implementation of SafeDict in Python, to 
know whether comparing the performance of SafeDict and dict would be 
representative of the two stages of fallback-dict performance, but 
certainly the performance cost of SafeDict would be an upper bound on 
the performance cost of fallback-dict, once conversion takes place, but 
would not measure the conversion cost.  The performance of fallback-dict 
does have to be significantly better than the performance of dict with 
collisions to be beneficial, but if the conversion cost is significant, 
triggering conversions could be an attack vector.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120123/96fa0456/attachment.html>

More information about the Python-Dev mailing list