Repeatability of looping over dicts
This post describes work aimed at getting Django to run on Jython: http://zyasoft.com/pythoneering/2008/01/django-on-jython-minding-gap.html One outstanding issue is whether to use Java's ConcurrentHashMap type to underly Jython's dict type. See <http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ConcurrentHashM...>. ConcurrentHashMap scales better in the face of threading because it doesn't lock the whole table when updating it, but iterating over the map can return elements in a different order each time. This would mean that list(dict_var) doesn't return values in the same order as a later call to list(dict_var), even if dict_var hasn't been modified. Why? Under the hood, there are 32 different locks, each guarding a subset of the hash buckets, so if there are multiple threads iterating over the dictionary, they may not go through the buckets in order. <http://www.ibm.com/developerworks/java/library/j-jtp08223/> discusses the implementation, at least in 2003. So, do Python implementations need to guarantee that list(dict_var) == a later result from list(dict_var)? --amk
A.M. Kuchling wrote:
So, do Python implementations need to guarantee that list(dict_var) == a later result from list(dict_var)?
As I just posted to the blog, yes. Look at section 3.8 of the reference manual (Mapping Types), specifically footnote 3: http://docs.python.org/lib/typesmapping.html (3) Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary's history of insertions and deletions. If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond. This allows the creation of (value, key) pairs using zip(): "pairs = zip(a.values(), a.keys())". The same relationship holds for the iterkeys() and itervalues() methods: "pairs = zip(a.itervalues(), a.iterkeys())" provides the same value for pairs. Another way to create the same list is "pairs = [(v, k) for (k, v) in a.iteritems()]". Tim Delaney
On Jan 4, 2008, at 3:05 PM, Tim Delaney wrote:
As I just posted to the blog, yes. Look at section 3.8 of the reference manual (Mapping Types), specifically footnote 3:
You type faster than I do. :-) This guarantee has been in place for about a decade, as I recall. -Fred -- Fred Drake <fdrake at acm.org>
A.M. Kuchling wrote:
So, do Python implementations need to guarantee that list(dict_var) == a later result from list(dict_var)?
As I just posted to the blog, yes. Look at section 3.8 of the reference manual (Mapping Types), specifically footnote 3: http://docs.python.org/lib/typesmapping.html (3) Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary's history of insertions and deletions. If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond. This allows the creation of (value, key) pairs using zip(): "pairs = zip(a.values(), a.keys())". The same relationship holds for the iterkeys() and itervalues() methods: "pairs = zip(a.itervalues(), a.iterkeys())" provides the same value for pairs. Another way to create the same list is "pairs = [(v, k) for (k, v) in a.iteritems()]". Tim Delaney
On Sat, Jan 05, 2008 at 07:05:55AM +1100, Tim Delaney wrote:
So, do Python implementations need to guarantee that list(dict_var) == a later result from list(dict_var)?
As I just posted to the blog, yes. Look at section 3.8 of the reference manual (Mapping Types), specifically footnote 3:
Ah, thanks! I guess that rules out using ConcurrentHashMap, short of materializing the entire list and sorting it in some way. Oh well. --amk
On Jan 4, 2008 12:05 PM, Tim Delaney <timothy.c.delaney@gmail.com> wrote:
history of insertions and deletions. If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond.
I looked over the Java code briefly (http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/Concurr...) and I don't see anything that would cause it to violate this condition. In particular, the locks don't affect the order. It's a complicated class though, so I could have missed it. Do you see a reason for it to change its iteration order spontaneously? If another thread is concurrently modifying the dict, that's an intervening modification, and changing the iteration order is fine. There's the second question of whether using ConcurrentHashMap for python dicts is a good idea. I can't find any mention of thread-safety in the Jython docs, so I assume it follows Java's rules, which are that most variables must be locked in order to be accessed and modified concurrently. Making dict a ConcurrentHashMap would provide a stronger guarantee for some built-in types but not others, which is likely to confuse people. Further, not all synchronized maps can be replaced by ConcurrentHashMap because you may need groups of operations to be atomic, while CHM only provides atomicity across single method calls. I think a new ConcurrentDict class would probably be a better way to integrate ConcurrentHashMap. -- Namasté, Jeffrey Yasskin
On Jan 4, 2008 11:50 AM, A.M. Kuchling <amk@amk.ca> wrote:
This post describes work aimed at getting Django to run on Jython: http://zyasoft.com/pythoneering/2008/01/django-on-jython-minding-gap.html
One outstanding issue is whether to use Java's ConcurrentHashMap type to underly Jython's dict type. See <http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ConcurrentHashM...>.
ConcurrentHashMap scales better in the face of threading because it doesn't lock the whole table when updating it, but iterating over the map can return elements in a different order each time. This would mean that list(dict_var) doesn't return values in the same order as a later call to list(dict_var), even if dict_var hasn't been modified.
Why? Under the hood, there are 32 different locks, each guarding a subset of the hash buckets, so if there are multiple threads iterating over the dictionary, they may not go through the buckets in order. <http://www.ibm.com/developerworks/java/library/j-jtp08223/> discusses the implementation, at least in 2003.
So, do Python implementations need to guarantee that list(dict_var) == a later result from list(dict_var)?
What code would break if we loosened this restriction? I guess defining d.items() as zip(d.keys(), d.values()) would no longer fly, but does anyone actually depend on this? Just like we changed how we think about auto-closing files once Jython came along, I think this is at least worth considering. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Jan 4, 2008, at 5:54 PM, Guido van Rossum wrote:
What code would break if we loosened this restriction? I guess defining d.items() as zip(d.keys(), d.values()) would no longer fly, but does anyone actually depend on this?
I don't know what code would break today; this was initially added to the set of promises made by the dict methods a decode ago, but it was in response to a need in the software I was working on at the time (possibly Grail, for which 2.6/3.0 isn't an issue). That question should probably be addressed to a fairly wide audience (comp.lang.python) since the promise has been there for so long. -Fred -- Fred Drake <fdrake at acm.org>
On Fri, Jan 04, 2008 at 02:54:49PM -0800, Guido van Rossum wrote:
What code would break if we loosened this restriction? I guess defining d.items() as zip(d.keys(), d.values()) would no longer fly, but does anyone actually depend on this? Just like we changed how we
http://www.google.com/codesearch?hl=en&q=+lang:python+zip+keys&start=10&sa=N turns up a few pieces of code that would break: trac-0.10.3/trac/versioncontrol/cache.py Twisted-2.2.0/twisted/pb/test/test_banana.py Mattricks-0.7/Mattricks/MatchPrediction.py FlickrUploadr-1.0.0/src/xmltramp.py Projects trying to stay compatible with Python versions that don't have .items() may be more likely to use this idiom. Some classes may do this to implement .items(); openbabel-2.1.1/scripts/python/pybel.py does this. So some code will break, and I don't see a way to warn about this problem before breaking compatibility. --amk
Guido> What code would break if we loosened this restriction? I don't know how much, but I do know I've relied on this behavior before. (In fact, I've asked about it before.) I guess the counter question to yours would be, "What would be gained by loosening this restriction"? If the answer is, "not much", then I don't see why this is even an idle thought. Skip
On Jan 5, 2008 6:58 AM, <skip@pobox.com> wrote:
Guido> What code would break if we loosened this restriction?
I don't know how much, but I do know I've relied on this behavior before. (In fact, I've asked about it before.) I guess the counter question to yours would be, "What would be gained by loosening this restriction"? If the answer is, "not much", then I don't see why this is even an idle thought.
It sounds like loosening the restriction would allow Jython to use an implementation that is more efficient when used concurrently. That may not be sufficient reason; Jython apps that need a more efficient concurrent dict could import the ConcurrentHashMap class directly, and CPython apps are out of luck anyway. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 1/4/08, A.M. Kuchling <amk@amk.ca> wrote:
This post describes work aimed at getting Django to run on Jython: http://zyasoft.com/pythoneering/2008/01/django-on-jython-minding-gap.html
One outstanding issue is whether to use Java's ConcurrentHashMap type to underly Jython's dict type. See <http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ConcurrentHashM...>.
As a side note, you may also want to consider using a NonBlockingHashMap as implemented by Cliff Click instead of ConcurrentHashMap: http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html http://blogs.azulsystems.com/cliff/2007/03/talking_with_go.html http://video.google.com/videoplay?docid=2139967204534450862 http://sourceforge.net/projects/high-scale-lib I haven't looked to see if it has the same issues repeatability issues but it does scale even better. Anyways, could you not use something with repeatability issues with a wrapper that caches lists of keys to satisfy the repeatability? (don't lock and maintain the list, just regenerate the list on demand when a caller needs it and discard the cached list anytime a key is added or deleted) Also, should we drop this guarantee in py3k to give future implementations more flexibility? -gps
ConcurrentHashMap scales better in the face of threading because it doesn't lock the whole table when updating it, but iterating over the map can return elements in a different order each time. This would mean that list(dict_var) doesn't return values in the same order as a later call to list(dict_var), even if dict_var hasn't been modified.
Why? Under the hood, there are 32 different locks, each guarding a subset of the hash buckets, so if there are multiple threads iterating over the dictionary, they may not go through the buckets in order. <http://www.ibm.com/developerworks/java/library/j-jtp08223/> discusses the implementation, at least in 2003.
So, do Python implementations need to guarantee that list(dict_var) == a later result from list(dict_var)?
--amk
participants (7)
-
A.M. Kuchling -
Fred Drake -
Gregory P. Smith -
Guido van Rossum -
Jeffrey Yasskin -
skip@pobox.com -
Tim Delaney