dict changes [was: Ordered storage of keyword arguments]
![](https://secure.gravatar.com/avatar/d91ce240d2445584e295b5406d12df70.jpg?s=120&d=mm&r=g)
On 10/28/10, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Thu, 28 Oct 2010 19:58:59 +0200 spir <denis.spir@gmail.com> wrote:
What does the current implementation use as buckets?
It uses an open addressing strategy. Each dict entry holds three pointer-sized fields: key object, value object, and cached hash value of the key. (set entries have only two fields, since they don't hold a value object)
Has anyone benchmarked not storing the hash value here? For a string dict, that hash should already be available on the string object itself, so it is redundant. Keeping it obviously improves cache locality, but ... it also makes the dict objects 50% larger, and there is a chance that the strings themselves would already be in cache anyhow. And if strings were reliably interned, the comparison check should normally just be a pointer compare -- possibly fast enough that the "different hash" shortcut doesn't buy anything. [caveats about still needing to go to the slower dict implementation for string subclasses] -jJ
![](https://secure.gravatar.com/avatar/60cac87fb9e2b5689242622999656cb0.jpg?s=120&d=mm&r=g)
On Oct 28, 2010, at 11:44 AM, Jim Jewett wrote:
It uses an open addressing strategy. Each dict entry holds three pointer-sized fields: key object, value object, and cached hash value of the key. (set entries have only two fields, since they don't hold a value object)
Has anyone benchmarked not storing the hash value here
That would be a small disaster. Either you call PyObject_Hash() for every probe (adding function call overhead for int and str, and adding tons of work for other types) or you can go directly to Py_RichCompareBool() which is never fast. I haven't timed this for dicts, but I did see major speed boosts in the performance of set-to-set operations when the internally stored hash was used instead of calling PyObject_Hash(). Raymond
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
Le jeudi 28 octobre 2010 à 14:44 -0400, Jim Jewett a écrit :
For a string dict, that hash should already be available on the string object itself, so it is redundant. Keeping it obviously improves cache locality, but ... it also makes the dict objects 50% larger, and there is a chance that the strings themselves would already be in cache anyhow. And if strings were reliably interned, the comparison check should normally just be a pointer compare -- possibly fast enough that the "different hash" shortcut doesn't buy anything. [caveats about still needing to go to the slower dict implementation for string subclasses]
I've thought about that. The main annoyance is to be able to switch transparently between the two implementations. But I think it would be interesting to pursue that effort, since indeed dicts with interned keys are the most common case of dicts in the average Python workload. Saving 1/3 of the memory size on these dicts would be worthwhile IMO. (addressing itself would perhaps be a bit simpler, because of multiplying by 8 or 16 instead of multiplying by 12 or 24. But I doubt the difference would be noticeable) Regartds Antoine.
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
Antoine Pitrou wrote:
Le jeudi 28 octobre 2010 à 14:44 -0400, Jim Jewett a écrit :
For a string dict, that hash should already be available on the string object itself, so it is redundant. Keeping it obviously improves cache locality, but ... it also makes the dict objects 50% larger, and there is a chance that the strings themselves would already be in cache anyhow. And if strings were reliably interned, the comparison check should normally just be a pointer compare -- possibly fast enough that the "different hash" shortcut doesn't buy anything. [caveats about still needing to go to the slower dict implementation for string subclasses]
I've thought about that. The main annoyance is to be able to switch transparently between the two implementations. But I think it would be interesting to pursue that effort, since indeed dicts with interned keys are the most common case of dicts in the average Python workload. Saving 1/3 of the memory size on these dicts would be worthwhile IMO.
Are you sure ? In the age of GB RAM, runtime performance appears to be more important than RAM usage. Moving the hash comparison out of the dict would likely cause (cache) locality to no longer trigger.
(addressing itself would perhaps be a bit simpler, because of multiplying by 8 or 16 instead of multiplying by 12 or 24. But I doubt the difference would be noticeable)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 29 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
Le vendredi 29 octobre 2010 à 10:51 +0200, M.-A. Lemburg a écrit :
I've thought about that. The main annoyance is to be able to switch transparently between the two implementations. But I think it would be interesting to pursue that effort, since indeed dicts with interned keys are the most common case of dicts in the average Python workload. Saving 1/3 of the memory size on these dicts would be worthwhile IMO.
Are you sure ? In the age of GB RAM, runtime performance appears to be more important than RAM usage. Moving the hash comparison out of the dict would likely cause (cache) locality to no longer trigger.
Good point. It probably depends on the collision rate. Also, a string key dict could be optimized for interned strings, in which case the hash comparison is unnecessary. (knowing whether the key is interned could be stored in e.g. the low-order bit of the key pointer) Regards Antoine.
participants (4)
-
Antoine Pitrou
-
Jim Jewett
-
M.-A. Lemburg
-
Raymond Hettinger