thread safe dictionary initialisation from mappings: dict(non_dict_mapping)

Hello, I found the following annoying behaviour of dict(non_dict_mapping) and dict.update(non_dict_mapping), if non_dict_mapping implements collections.abc.Mapping but is not an instance of dict. In this case the implementations of dict() and dict.update() use PyDict_Merge(PyObject *a, PyObject *b, int override). The essential part of PyDict_Merge(a,b, override) is # update dict a with the content of mapping b. keys = b.keys() for key in keys: ... a[key] = b.__getitem__(key) This algorithm is susceptible to race conditions, if a second thread modifies the source mapping b between "b.keys()" and b.__getitem__(key): - If the second thread deletes an item from b, PyDict_Merge fails with a KeyError exception. - If the second thread inserts a new value and then modifies an existing value, a contains the modified value but not the new value. Of course the current behaviour is the best you can get with a "minimum mapping interface". My Goal ------- I want to be able to implement a mapping so that "dict(mapping)" works in an "atomic" way. Requirements for a solution --------------------------- - it does not modify the behaviour of existing code - the performance must be similar to the current implementation - simple to implement - sufficiently generic Proposal -------- My idea is to define a new optional method "__update_other__(setter)" for mappings. A minimal implementation of this method looks like: def __update_other__(self, setter): for key in self.keys(): setter(key, self[key]) Now it is possible to extend PyDict_Merge(PyObject *a, PyObject *b, int override) to check, if the source mapping b implements "__update_other__()". If it does, PyDict_Merge simply calls b.__update_other__. Otherwise it falls back to the current implementation using keys() and __getitem__(). Pseudo code for PyDict_Merge(PyObject *a, PyObject *b, int override) if hasattr(b, "__update__other__") and callable(b.__update_other__): if override: b.__update_other__(a.__setitem__) else: b.__update_other__(a.setdefault) return # old code follows keys = b.keys() for key in keys: a[key] = b.__getitem__(key) Example ------- A typical implementation of __update_other__ for a mapping with concurrent access looks like: def __update_other__(self, setter): # perform an appropriate locking with self._lock: for key in self._keys: setter(key, self._values[key]) Note 1: The __update_other__(self, setter) interface hides the nature of the caller. It is also possible to retrieve only the keys or only the values using suitable setter arguments. Note 2: In order to preserve the behaviour of existing code, the method __update_other__() must not be added to existing mapping implementations (collections.abc.Mapping, UserDict, ...). But adding support for __update_other__ to __init__() and update() of existing implementations is OK. Note 3: If we add add a test for __update_other__ to PyDict_Merge in front of the "PyDict_Check(b)" test this would also solve the issues described in http://bugs.python.org/issue1615701. A sub-class of dict could implement __update_other__ to prevent PyDict_Merge from accessing its underlying dict implementation. Note 4: Alternative approach Obviously it is possible to write "dict(non_dict_mapping.items())" (or even "dict(non_dict_mapping.iteritems())", if iteritems() is implemented in a suitable way), but constructing the list of items is an expensive operation. Additionally it is fairly common and pythonic to write "dict(mapping)" to get a copy of mapping. Anselm -- Dipl. Phys. Anselm Kruis science + computing ag Senior Solution Architect Ingolstädter Str. 22 email A.Kruis@science-computing.de 80807 München, Germany phone +49 89 356386 874 fax 737 www.science-computing.de -- Vorstandsvorsitzender/Chairman of the board of management: Gerd-Lothar Leonhart Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196

On 11/19/2012 12:24 PM, Anselm Kruis wrote:
It is well-known that mutating a collection while iterating over it can lead to unexpected or undesired behavior, including exceptions. This is not limited updating a dict from a non-dict source. The generic answer is Don't Do That.
Of course the current behaviour is the best you can get with a "minimum mapping interface".
To me, if you know that the source in d.update(source) is managed (and mutated) in another thread, the obvious solution (to Not Do That) is to lock the source. This should work for any source and for any similar operation. What am I missing? Instead, you propose to add a specialized, convoluted method that only works for updates of dicts by non_dict_mappings that happen to have a new and very specialized magic method that automatically does the lock. Sorry, I don't see the point. It is not at all a generic solution to a generic problem. -- Terry Jan Reedy

Am 19.11.2012 22:09, schrieb Terry Reedy:
Actually that's not the case here: the implementation of dict does not iterate over the collection while another thread mutates the collection. It iterates over a list of the keys and this list does not change.
It is the automatic locking. For list- and set-like collections it is already possible to implement this kind of automatic locking, because iterating over them returns the complete information. Mappings are special because of their key-value items. If automatic locking of a collection is the right solution to a particular problem, depends on the problem. There are problems, where automatic locking is the best choice. I think, python should support it. (If my particular applications belongs to this class of problems is another question and not relevant here.) Regards Anselm -- Dipl. Phys. Anselm Kruis science + computing ag Senior Solution Architect Ingolstädter Str. 22 email A.Kruis@science-computing.de 80807 München, Germany phone +49 89 356386 874 fax 737 www.science-computing.de -- Vorstandsvorsitzender/Chairman of the board of management: Gerd-Lothar Leonhart Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196

On Tue, Nov 20, 2012 at 7:33 PM, Anselm Kruis <a.kruis@science-computing.de>wrote:
Whether or not the keys() method makes a copy of the underlying keys is entirely up to the collection - e.g. the Python 3 dict type returns a live view of the underlying dictionary from keys()/values()/items() rather than returning a list copy as it did in Python 2. Building and working with containers in a thread safe manner is inherently challenging, and given that the standard Python containers only make minimal efforts in that direction (relying heavily on the GIL and the way it interacts with components written in C) it's unlikely you're ever going to achieve adequate results without exposing an explicit locking API. For example, you could make your container a context manager, so people could write: with my_threadsafe_mapping: dict_copy = dict(my_threadsafe_mapping) This has the advantage of making it easy to serialise *any* multi-step operation on your container on its internal lock, not just the specific case of copying to a builtin dictionary. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, 20 Nov 2012 10:33:53 +0100 Anselm Kruis <a.kruis@science-computing.de> wrote:
Automatic locking is rarely the solution to any problem, especially with such general-purposes objects as associative containers. In many cases, you have a bit more to protect than simply the dict's contents. Most of Python's types and routines promise to be "safe" in the face of threads in the sense that they won't crash, but they don't promise to do "what you want" since what you want will really depend on the use case. (there are a couple of special cases such as binary files where we try to ensure meaningful thread-safety, because that's what users expect from their experience with system APIs) Regards Antoine.

On Tue, Nov 20, 2012 at 10:30 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
There is an open documentation issue related to this here: http://bugs.python.org/issue15339 --Chris

On 11/19/2012 12:24 PM, Anselm Kruis wrote:
It is well-known that mutating a collection while iterating over it can lead to unexpected or undesired behavior, including exceptions. This is not limited updating a dict from a non-dict source. The generic answer is Don't Do That.
Of course the current behaviour is the best you can get with a "minimum mapping interface".
To me, if you know that the source in d.update(source) is managed (and mutated) in another thread, the obvious solution (to Not Do That) is to lock the source. This should work for any source and for any similar operation. What am I missing? Instead, you propose to add a specialized, convoluted method that only works for updates of dicts by non_dict_mappings that happen to have a new and very specialized magic method that automatically does the lock. Sorry, I don't see the point. It is not at all a generic solution to a generic problem. -- Terry Jan Reedy

Am 19.11.2012 22:09, schrieb Terry Reedy:
Actually that's not the case here: the implementation of dict does not iterate over the collection while another thread mutates the collection. It iterates over a list of the keys and this list does not change.
It is the automatic locking. For list- and set-like collections it is already possible to implement this kind of automatic locking, because iterating over them returns the complete information. Mappings are special because of their key-value items. If automatic locking of a collection is the right solution to a particular problem, depends on the problem. There are problems, where automatic locking is the best choice. I think, python should support it. (If my particular applications belongs to this class of problems is another question and not relevant here.) Regards Anselm -- Dipl. Phys. Anselm Kruis science + computing ag Senior Solution Architect Ingolstädter Str. 22 email A.Kruis@science-computing.de 80807 München, Germany phone +49 89 356386 874 fax 737 www.science-computing.de -- Vorstandsvorsitzender/Chairman of the board of management: Gerd-Lothar Leonhart Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196

On Tue, Nov 20, 2012 at 7:33 PM, Anselm Kruis <a.kruis@science-computing.de>wrote:
Whether or not the keys() method makes a copy of the underlying keys is entirely up to the collection - e.g. the Python 3 dict type returns a live view of the underlying dictionary from keys()/values()/items() rather than returning a list copy as it did in Python 2. Building and working with containers in a thread safe manner is inherently challenging, and given that the standard Python containers only make minimal efforts in that direction (relying heavily on the GIL and the way it interacts with components written in C) it's unlikely you're ever going to achieve adequate results without exposing an explicit locking API. For example, you could make your container a context manager, so people could write: with my_threadsafe_mapping: dict_copy = dict(my_threadsafe_mapping) This has the advantage of making it easy to serialise *any* multi-step operation on your container on its internal lock, not just the specific case of copying to a builtin dictionary. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, 20 Nov 2012 10:33:53 +0100 Anselm Kruis <a.kruis@science-computing.de> wrote:
Automatic locking is rarely the solution to any problem, especially with such general-purposes objects as associative containers. In many cases, you have a bit more to protect than simply the dict's contents. Most of Python's types and routines promise to be "safe" in the face of threads in the sense that they won't crash, but they don't promise to do "what you want" since what you want will really depend on the use case. (there are a couple of special cases such as binary files where we try to ensure meaningful thread-safety, because that's what users expect from their experience with system APIs) Regards Antoine.

On Tue, Nov 20, 2012 at 10:30 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
There is an open documentation issue related to this here: http://bugs.python.org/issue15339 --Chris
participants (5)
-
Anselm Kruis
-
Antoine Pitrou
-
Chris Jerdonek
-
Nick Coghlan
-
Terry Reedy