<br><div class="gmail_quote">On Wed, Jan 18, 2012 at 9:55 AM, &quot;Martin v. Löwis&quot; <span dir="ltr">&lt;<a href="mailto:martin@v.loewis.de" target="_blank">martin@v.loewis.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Am 18.01.2012 17:01, schrieb PJ Eby:<br>

<div>&gt; On Tue, Jan 17, 2012 at 7:58 PM, &quot;Martin v. Löwis&quot; &lt;<a href="mailto:martin@v.loewis.de" target="_blank">martin@v.loewis.de</a><br>

</div><div><div>&gt; &lt;mailto:<a href="mailto:martin@v.loewis.de" target="_blank">martin@v.loewis.de</a>&gt;&gt; wrote:<br>

&gt;<br>

&gt;     Am 17.01.2012 22:26, schrieb Antoine Pitrou:<br>

&gt;     &gt; Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits<br>

&gt;     &gt; could cache a &quot;hash perturbation&quot; computed from the string and the<br>

&gt;     &gt; random bits:<br>

&gt;     &gt;<br>

&gt;     &gt; - hash() would use ob_shash<br>

&gt;     &gt; - dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate &amp; ~3))<br>

&gt;     &gt;<br>

&gt;     &gt; This way, you cache almost all computations, adding only a computation<br>

&gt;     &gt; and a couple logical ops when looking up a string in a dict.<br>

&gt;<br>

&gt;     That&#39;s a good idea. For Unicode, it might be best to add another slot<br>

&gt;     into the object, even though this increases the object size.<br>

&gt;<br>

&gt; Wouldn&#39;t that break the ABI in 2.x?<br>

<br>

</div></div>I was thinking about adding the field at the end, so I thought it<br>

shouldn&#39;t. However, if somebody inherits from PyUnicodeObject, it still<br>

might - so my new proposal is to add the extra hash into the str block,<br>

either at str[-1], or after the terminating 0. This would cause an<br>

average increase of four bytes of the storage (0 bytes in 50% of the<br>

cases, 8 bytes because of padding in the other 50%).<br>

<br>

What do you think?<br></blockquote><div><br></div><div>str[-1] is not likely to work if you want to maintain ABI compatibility.  Appending it to the data after the terminating \0 is more likely to be possible, but if there is any possibility that existing compiled extension modules have somehow inlined code to do allocation of the str field even that is questionable (i don&#39;t think there are?).</div>


<div><br></div><div>I&#39;d also be concerned about C API code that uses PyUnicode_Resize(). How do you keep track of if you have filled in these extra bytes at the end in or not?  allocation and resize fill it with a magic value indicating &quot;not filled in&quot; similar to a tp_hash of -1?</div>


<div><br></div><div>Regardless of all of this, I don&#39;t think this fully addresses the overall issue as strings within other hashable data structures like tuples would not be treated this way, only strings directly stored in a dict.  Sure you can continue on and &quot;fix&quot; tuples and such in a similar manner but then what about user defined classes that implement __hash__ based on the return value of hash() on some strings they contain?</div>


<div><br></div><div>I don&#39;t see anything I&#39;d consider a real complete fix unless we also backport the randomized hash code so that people who need a guaranteed fix can enable it and use it.</div><div><br></div><div>


-gps</div></div>