[Python-Dev] Hashing proposal: change only string-only dicts

Thu Jan 19 16:17:18 CET 2012

On Jan 18, 2012 12:55 PM, Martin v. Löwis <martin at v.loewis.de> wrote:
>
> Am 18.01.2012 17:01, schrieb PJ Eby:
> > On Tue, Jan 17, 2012 at 7:58 PM, "Martin v. Löwis" <martin at v.loewis.de
> > <mailto:martin at v.loewis.de>> wrote:
> >
> >     Am 17.01.2012 22:26, schrieb Antoine Pitrou:
> >     > Only 2 bits are used in ob_sstate, meaning 30 are left. These 30
bits
> >     > could cache a "hash perturbation" computed from the string and the
> >     > random bits:
> >     >
> >     > - hash() would use ob_shash
> >     > - dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate &
~3))
> >     >
> >     > This way, you cache almost all computations, adding only a
computation
> >     > and a couple logical ops when looking up a string in a dict.
> >
> >     That's a good idea. For Unicode, it might be best to add another
slot
> >     into the object, even though this increases the object size.
> >
> >
> > Wouldn't that break the ABI in 2.x?
>
> I was thinking about adding the field at the end, so I thought it
> shouldn't. However, if somebody inherits from PyUnicodeObject, it still
> might - so my new proposal is to add the extra hash into the str block,
> either at str[-1], or after the terminating 0. This would cause an
> average increase of four bytes of the storage (0 bytes in 50% of the
> cases, 8 bytes because of padding in the other 50%).
>
> What do you think?

So far it sounds like the very best solution of all, as far as backward
compatibility is concerned.  If the extra bits are only used when two
strings have a matching hash value, the only doctests that could be
affected are ones testing for this issue.  ;-)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120119/181f1ad6/attachment.html>