[RFC] Removing pure Python implementation of OrderedDict
Hi, all.
Currently, deque and defaultdict have only C implementation.
Python implementations should provide _collections.deque
and _collections.defaultdict.
Like that, how about removing OrderedDict Pure Python implementation
from stdlib and require it to implementation?
## Pros
### Thread safety
AFAIK, there are no thread safety guarantee in OrderedDict.
I don't look carefully, but some methods seems thread unsafe.
I'm considering adding `OrderedDict.lru_get(key, default=None)` which
atomically lookup key and move the key to end if found.
It can be used when functools.lru_cahce can't be used.
(see http://bugs.python.org/issue28193 )
And thread safety is very important for such method.
Anyway, thread safety will improve usability of OrderedDict significantly,
like defaultdict.
### Less maintenance cost of test_ordered_dict.
I'm sending pull request for removing doubly linked list from C OrderedDict,
for faster creatin, iteration, and reduce memory usage by 1/2.
See https://bugs.python.org/issue31265
While implementing it, I noticed that current test_ordered_dict has some
tests about implementation detail without @cpython_only decorator.
For example:
https://github.com/python/cpython/blob/fcd97d44382df520e39de477a5ec99dd89c3f...
While current test expects KeyError, my pull request (and PyPy's OrderedDict)
doesn't raise KeyError because inconsistency between doubly linked list
and dict never happens.
PyPy changed the test:
https://bitbucket.org/pypy/pypy/src/ac3e33369ba0aa14278a6a3e8f2add09590d358c/lib-python/3/test/test_ordered_dict.py?at=py3.6&fileviewer=file-view-default#test_ordered_dict.py-525:542
My pull request has same change:
https://github.com/methane/cpython/pull/9/files#diff-23c28e7fa52967682669d90...
Maintain compatibility with odd behavior of Pure Python implementation
is not constructive job not only for us, but also other Python implementations.
### `import collections` bit faster.
Pure Python implementation has 5 classes (OrderedDict, _Link,
_OrderedDictKeyViews, _OrderedDictItemsView, and _OrderedDictValuesView).
Three of them inheriting from ABC. So it makes importing collections
bit slower.
## Cons
* All Python 3.7 implementations should provide _collections.OrderedDict
PyPy has it already. But I don't know about micropython.
Regards,
INADA Naoki
05.09.17 11:38, INADA Naoki пише:
## Cons
* All Python 3.7 implementations should provide _collections.OrderedDict PyPy has it already. But I don't know about micropython.
Current C implementation of OrderedDict is not safe regarding using mutating dict methods (or dict C API) like dict.__setitem__ or PyDict_SetItem. Using them can cause hangs or segfaults. See issue24726 and issue25410. I hope your implementation will solve these issues, but there may be others. While the C implementation still is not enough mature, we should allow users that encountered one of such issues to use pure Python implementation which is free from hangs and segfaults.
I thought the decision a few years ago was that all modules that have a C library for performance reasons should also have a Python version? Did this decision change at some point? (just curious).
On Tue, Sep 5, 2017 at 9:58 PM, Simon Cross
I thought the decision a few years ago was that all modules that have a C library for performance reasons should also have a Python version? Did this decision change at some point? (just curious).
But in this case, not only for optimize, but also better behavior.
Pure Python version is not thread safe even for `od[k] = v`.
It's very difficult to write thread safe complex container in Pure Python.
Regards,
INADA Naoki
On Tue, Sep 5, 2017 at 5:58 AM, Simon Cross
I thought the decision a few years ago was that all modules that have a C library for performance reasons should also have a Python version? Did this decision change at some point? (just curious).
It was never meant as a hard rule. IIRC the actual rule is more that *if* you have both a C and a Python version they need to supply the same API, *and* the naming should be e.g. _pickle (C) and pickle (Python), *and* the Python version should automatically replace itself with the C version when the latter exists (e.g. by putting an import * at the end with a try/except around it). There are tons of modules that only have a C version and there's no need to change this -- and I'm fine with occasionally a module removing the Python version when the C version is deemed sufficiently stable and compatible. (In some cases a move in the opposite direction may also be reasonable. No two cases are exactly alike.) -- --Guido van Rossum (python.org/~guido)
On 9/5/2017 10:40 AM, Guido van Rossum wrote:
On Tue, Sep 5, 2017 at 5:58 AM, Simon Cross
mailto:hodgestar+pythondev@gmail.com> wrote: I thought the decision a few years ago was that all modules that have a C library for performance reasons should also have a Python version? Did this decision change at some point? (just curious).
It was never meant as a hard rule. IIRC the actual rule is more that *if* you have both a C and a Python version they need to supply the same API, *and* the naming should be e.g. _pickle (C) and pickle (Python), *and* the Python version should automatically replace itself with the C version when the latter exists (e.g. by putting an import * at the end with a try/except around it).
*And* both versions should be tested in the test file ;-). -- Terry Jan Reedy
On Tue, Sep 5, 2017 at 8:48 PM, Serhiy Storchaka
05.09.17 11:38, INADA Naoki пише:
## Cons
* All Python 3.7 implementations should provide _collections.OrderedDict PyPy has it already. But I don't know about micropython.
Current C implementation of OrderedDict is not safe regarding using mutating dict methods (or dict C API) like dict.__setitem__ or PyDict_SetItem. Using them can cause hangs or segfaults. See issue24726 and issue25410. I hope your implementation will solve these issues, but there may be others.
Thanks for sharing these issues. My pull request is still early stage and may have some bugs. But I think it will solve many of issues because most odd behaviors comes from difficult of consistency between linked list and dict.
While the C implementation still is not enough mature, we should allow users that encountered one of such issues to use pure Python implementation which is free from hangs and segfaults.
I see. Maybe, we can move Pure Python OrderedDict to collections._odict module, and collections/__init__.py can be like this: from _collections import OrderedDict # Pure Python implementation can be used when C implementation cause segfault. # from collections._odict import OrderedDict
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
On Tue, 5 Sep 2017 17:38:06 +0900
INADA Naoki
Like that, how about removing OrderedDict Pure Python implementation from stdlib and require it to implementation?
I don't like this. The C version of OrderedDict is probably very hard to read, while the Python version helps understand the semantics. Also, when devising a new API, it's much easier to first try it out on the Python version.
I'm considering adding `OrderedDict.lru_get(key, default=None)` which atomically lookup key and move the key to end if found.
The Python version needn't have the same atomicity requirements as the C version. Regards Antoine.
On Tue, Sep 5, 2017 at 1:38 AM, INADA Naoki
Like that, how about removing OrderedDict Pure Python implementation from stdlib and require it to implementation?
-1 Like Antoine, I consider the pure Python implementation to be valuable. Furthermore, the pure Python implementation is the reference, so its behavior is idiomatic.
### Thread safety
AFAIK, there are no thread safety guarantee in OrderedDict. I don't look carefully, but some methods seems thread unsafe.
What isn't thread-safe? I know that Raymond has a good understanding of this area. For instance, he was very clear about re-entrancy concerns when I was working on the C implementation. I recommend getting feedback from him on this. FWIW, I don't recall any bugs related to thread-safety in OrderedDict, even though it's been around a while.
[snip]
### Less maintenance cost of test_ordered_dict.
[snip]
I don't find this to be a strong argument. If there are concerns with the reference behavior then those should be addressed rather than used to justify removing the implementation.
### `import collections` bit faster.
[snip]
This is not a strong argument. The difference is not significant enough to warrant removing the reference implementation. So, again, I'm against removing the pure Python implementation of OrderedDict. I highly recommend making sure you have Raymond's cooperation before making changes in the collections module. -eric
On Tue, Sep 5, 2017 at 12:13 PM, Eric Snow
On Tue, Sep 5, 2017 at 1:38 AM, INADA Naoki
wrote: Like that, how about removing OrderedDict Pure Python implementation from stdlib and require it to implementation?
-1
Like Antoine, I consider the pure Python implementation to be valuable. Furthermore, the pure Python implementation is the reference, so its behavior is idiomatic.
-1 as well, I'm in favor of keeping Python implementations of things implemented in C, no matter the order in which they are added.
### Less maintenance cost of test_ordered_dict.
[snip]
I don't find this to be a strong argument. If there are concerns with the reference behavior then those should be addressed rather than used to justify removing the implementation.
I agree here as well. We have plenty of things implemented in both Python and C, and have helpers in test.support to make it (relatively, at least) easy to test both. PEP 399 is also relevant here, I think. -- Zach
First of all, I saw enough -1 so I gave up about removing.
Following reply is just a technical topic.
On Wed, Sep 6, 2017 at 2:13 AM, Eric Snow
Like Antoine, I consider the pure Python implementation to be valuable. Furthermore, the pure Python implementation is the reference, so its behavior is idiomatic.
What is *idiomatic*? For example, in this test case: https://github.com/python/cpython/blob/564a2c68add64ebf2e558a54f5697513b1929... def test_dict_delitem(self): OrderedDict = self.OrderedDict od = OrderedDict() od['spam'] = 1 od['ham'] = 2 dict.__delitem__(od, 'spam') with self.assertRaises(KeyError): repr(od) Since dict.__delitem__ is same to OrderedDict.__delitem__ in PyPy implementation, repr(od) doesn't raise KeyError. >>>> import collections >>>> od = collections.OrderedDict() >>>> od['spam'] = 1 >>>> od['ham'] = 2 >>>> dict.__delitem__(od, 'spam') >>>> repr(od) "OrderedDict([('ham', 2)])" Do you think this behavior is not idiomatic? I feel both are OK and there are no idiomatic behavior. The tested behavior seems just an implementation detail. At least, PyPy modified this test as I wrote in previous mail. It makes sense to me.
### Thread safety
AFAIK, there are no thread safety guarantee in OrderedDict. I don't look carefully, but some methods seems thread unsafe.
What isn't thread-safe? I know that Raymond has a good understanding of this area. For instance, he was very clear about re-entrancy concerns when I was working on the C implementation. I recommend getting feedback from him on this. FWIW, I don't recall any bugs related to thread-safety in OrderedDict, even though it's been around a while.
For example, when one thread do `od[k1] = v1` and another thread do `od[k2] = v2`, result should be equivalent to one of `od[k1] = v1; od[k2] = v2;` or `od[k2] = v2; od[k1] = v1`. And internal data structure should be consistent. But see current Pure Python implementation: def __setitem__(self, key, value, dict_setitem=dict.__setitem__, proxy=_proxy, Link=_Link): 'od.__setitem__(i, y) <==> od[i]=y' # Setting a new item creates a new link at the end of the linked list, # and the inherited dictionary is updated with the new key/value pair. if key not in self: self.__map[key] = link = Link() root = self.__root last = root.prev link.prev, link.next, link.key = last, root, key last.next = link root.prev = proxy(link) dict_setitem(self, key, value) If thread is switched right after `root = self.__root`, either k1 or k2 is disappeared in linked list. More worth, if thread switch is happen right after `last = root.prev`, linked list will be broken. Fixing such issue is not easy job. See http://bugs.python.org/issue14976 for example. On the other hand, thanks to GIL, C implementation (not only mine, but also your's) doesn't have such issue. Sometime, writing reentrant and threadsafe container in C is easier than in Pure Python. Regards, P.S. Since it's over 3 a.m. in Japan, I'll read and reply other mails tomorrow. Sorry for delay.
On Wed, 6 Sep 2017 03:20:43 +0900
INADA Naoki
What is *idiomatic*? For example, in this test case:
https://github.com/python/cpython/blob/564a2c68add64ebf2e558a54f5697513b1929...
def test_dict_delitem(self): OrderedDict = self.OrderedDict od = OrderedDict() od['spam'] = 1 od['ham'] = 2 dict.__delitem__(od, 'spam') with self.assertRaises(KeyError): repr(od)
Since dict.__delitem__ is same to OrderedDict.__delitem__ in PyPy implementation, repr(od) doesn't raise KeyError.
Since this test is testing an implementation detail, it can safely be moved to a test class specific to the pure Python implementation.
For example, when one thread do `od[k1] = v1` and another thread do `od[k2] = v2`, result should be equivalent to one of `od[k1] = v1; od[k2] = v2;` or `od[k2] = v2; od[k1] = v1`. And internal data structure should be consistent.
I agree the pure Python OrderedDict is not thread-safe. But who said it should? And, actually, are you sure the C implementation is? GIL switches can happen at many points. A simple Py_DECREF, or perhaps even a tuple allocation can trigger a GC run that may release the GIL at some point in a finalizer function.
Sometime, writing reentrant and threadsafe container in C is easier than in Pure Python.
The C and Python versions needn't make exactly the same guarantees. Regards Antoine.
For example, when one thread do `od[k1] = v1` and another thread do `od[k2] = v2`, result should be equivalent to one of `od[k1] = v1; od[k2] = v2;` or `od[k2] = v2; od[k1] = v1`. And internal data structure should be consistent.
I agree the pure Python OrderedDict is not thread-safe. But who said it should?
No one. I meant just let's say "it should" from Python 3.7.
And, actually, are you sure the C implementation is? GIL switches can happen at many points. A simple Py_DECREF, or perhaps even a tuple allocation can trigger a GC run that may release the GIL at some point in a finalizer function.
I know such difficulity already. I thought simple (when key is str or int) __setitem__ doesn't break internal state of current C implementation of OrderedDict. And I'll carefully review my code too.
Sometime, writing reentrant and threadsafe container in C is easier than in Pure Python.
The C and Python versions needn't make exactly the same guarantees.
Yes, of course.
But what we can recommend for library developers (including stdlib)?
Currently, dict ordering is implementation detail of CPython and PyPy.
We don't recommend to rely on the behavior.
Like that, should we say "atomic & threadsafe __setitem__ for simple
key is implementation detail of CPython and PyPy. We recommend
using mutex when using OrderedDict from multiple thread."?
That was my point about thread safety.
Regards,
INADA Naoki
On Wed, 6 Sep 2017 11:26:52 +0900
INADA Naoki
Like that, should we say "atomic & threadsafe __setitem__ for simple key is implementation detail of CPython and PyPy. We recommend using mutex when using OrderedDict from multiple thread."?
I think you may be overstating the importance of making OrderedDict thread-safe. It's quite rare to be able to rely on the thread safety of a single structure, since most often your state is more complex than that and you have to use a lock anyway. The statu quo is that only experts rely on the thread-safety of list and dict, and they should be ready to reconsider if some day the guarantees change. Regards Antoine.
On 6 September 2017 at 11:09, Antoine Pitrou
On Wed, 6 Sep 2017 11:26:52 +0900 INADA Naoki
wrote: Like that, should we say "atomic & threadsafe __setitem__ for simple key is implementation detail of CPython and PyPy. We recommend using mutex when using OrderedDict from multiple thread."?
I think you may be overstating the importance of making OrderedDict thread-safe. It's quite rare to be able to rely on the thread safety of a single structure, since most often your state is more complex than that and you have to use a lock anyway.
The statu quo is that only experts rely on the thread-safety of list and dict, and they should be ready to reconsider if some day the guarantees change.
Agreed. I wasn't even aware that list and dict were guaranteed threadsafe (in the language reference). And even if they are, there's going to be a lot of provisos that mean in practice you need to know what you're doing to rely on that. Simple example: mydict[the_value()] += 1 isn't thread safe, no matter how thread safe dictionaries are. I don't have a strong opinion on making OrderedDict "guaranteed thread safe" according to the language definition. But I'm pretty certain that whether we do or not will make very little practical difference to the vast majority of Python users. Paul
OK, I stop worring about thread safety and other implementation
detail behavior on edge cases.
Thanks,
INADA Naoki
On 6 September 2017 at 11:09, Antoine Pitrou
wrote: On Wed, 6 Sep 2017 11:26:52 +0900 INADA Naoki
wrote: Like that, should we say "atomic & threadsafe __setitem__ for simple key is implementation detail of CPython and PyPy. We recommend using mutex when using OrderedDict from multiple thread."?
I think you may be overstating the importance of making OrderedDict thread-safe. It's quite rare to be able to rely on the thread safety of a single structure, since most often your state is more complex than that and you have to use a lock anyway.
The statu quo is that only experts rely on the thread-safety of list and dict, and they should be ready to reconsider if some day the guarantees change.
Agreed. I wasn't even aware that list and dict were guaranteed threadsafe (in the language reference). And even if they are, there's going to be a lot of provisos that mean in practice you need to know what you're doing to rely on that. Simple example:
mydict[the_value()] += 1
isn't thread safe, no matter how thread safe dictionaries are.
I don't have a strong opinion on making OrderedDict "guaranteed thread safe" according to the language definition. But I'm pretty certain that whether we do or not will make very little practical difference to the vast majority of Python users.
Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
https://www.youtube.com/watch?v=pNe1wWeaHOU&list=PLYI8318YYdkCsZ7dsYV01n6TZhXA6Wf9i&index=1
On Wed, Sep 6, 2017 at 5:49 PM, INADA Naoki
OK, I stop worring about thread safety and other implementation detail behavior on edge cases.
Thanks,
INADA Naoki
On Wed, Sep 6, 2017 at 7:40 PM, Paul Moore
wrote: On 6 September 2017 at 11:09, Antoine Pitrou
wrote: On Wed, 6 Sep 2017 11:26:52 +0900 INADA Naoki
wrote: Like that, should we say "atomic & threadsafe __setitem__ for simple key is implementation detail of CPython and PyPy. We recommend using mutex when using OrderedDict from multiple thread."?
I think you may be overstating the importance of making OrderedDict thread-safe. It's quite rare to be able to rely on the thread safety of a single structure, since most often your state is more complex than that and you have to use a lock anyway.
The statu quo is that only experts rely on the thread-safety of list and dict, and they should be ready to reconsider if some day the guarantees change.
Agreed. I wasn't even aware that list and dict were guaranteed threadsafe (in the language reference). And even if they are, there's going to be a lot of provisos that mean in practice you need to know what you're doing to rely on that. Simple example:
mydict[the_value()] += 1
isn't thread safe, no matter how thread safe dictionaries are.
I don't have a strong opinion on making OrderedDict "guaranteed thread safe" according to the language definition. But I'm pretty certain that whether we do or not will make very little practical difference to the vast majority of Python users.
Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/agoretoy%40gmail.com
On Wed, Sep 6, 2017 at 3:49 AM, INADA Naoki
OK, I stop worring about thread safety and other implementation detail behavior on edge cases.
That sounds like overreacting. At the risk of stating the obvious: I want the data structure itself to maintain its integrity even under edge cases of multi-threading. However that's different from promising that all (or certain) operations will be atomic in all cases. (Plus, for dicts and sets and other data structures that compare items, you can't have atomicity if those comparisons call back into Python -- so it's extra important that even when that happens the data structure's *integrity* is still maintained.) IMO, in edge cases, it's okay to not do an operation, do it twice, get the wrong answer, or raise an exception, as long as the data structure's internal constraints are still satisfied (e.g. no dangling pointers, no inconsistent indexes, that sort of stuff.) -- --Guido van Rossum (python.org/~guido)
On 9/5/2017 2:20 PM, INADA Naoki wrote:
On Wed, Sep 6, 2017 at 2:13 AM, Eric Snow
wrote:
Like Antoine, I consider the pure Python implementation to be valuable. Furthermore, the pure Python implementation is the reference, so its behavior is idiomatic.
To this native English speaker, 'behavior is idiomatic' is not idiomatic ;-). The word 'idiomatic' refers to mode of expression, not mechanical behavior. Perhaps what Eric is saying is that the behavior of the reference implementation is by definition correct. Yet we have learned over the years to be more careful about what cpython behaviors constitute 'python' and which are incidental implementations. 'Improving the test suite' sometimes mean segregating the two.
What is *idiomatic*? Expression that seems natural to a native speaker of a language, dialect, or subculture.
The question you address is which behaviors of a Python OrderedDict are part of the definition and which are implementation-specific, and whether the tests need improvement. -- Terry Jan Reedy
participants (10)
-
alex goretoy
-
Antoine Pitrou
-
Eric Snow
-
Guido van Rossum
-
INADA Naoki
-
Paul Moore
-
Serhiy Storchaka
-
Simon Cross
-
Terry Reedy
-
Zachary Ware