[Python-Dev] Hashing proposal: change only string-only dicts

Antoine Pitrou solipsis at pitrou.net
Tue Jan 17 22:26:11 CET 2012


On Tue, 17 Jan 2012 21:59:28 +0100
"Martin v. Löwis" <martin at v.loewis.de> wrote:
> I'd like to propose a different approach to seeding the string hashes:
> only do so for dictionaries involving only strings, and leave the
> tp_hash slot of strings unchanged.

I think Python 3 would be better with a clean fix (all hashes
randomized).
Now for Python 2... The problem with this idea is that it only
addresses str dicts. Unicode dicts, and any other dicts, are left
vulnerable. Unicode dicts are quite likely in Web
frameworks/applications and other places which have well-thought text
semantics.

That said, here's a suggestion to squeeze those bits:

> 1. add an additional field to all string objects, to cache the second
>    hash value.
>    a) variant: in 3.3, drop the extra field, and declare that hashes
>    may change across runs

In 2.7, a string object has the following fields:

    long ob_shash;
    int ob_sstate;

Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits
could cache a "hash perturbation" computed from the string and the
random bits:

- hash() would use ob_shash
- dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate & ~3))

This way, you cache almost all computations, adding only a computation
and a couple logical ops when looking up a string in a dict.

Regards

Antoine.




More information about the Python-Dev mailing list