Putting `blist` into collections module
Hi everybody, In 2007, this PEP was created that suggested integrating Daniel Stutzbach's blist into Python: http://legacy.python.org/dev/peps/pep-3128/ The PEP was rejected, but Raymond Hettinger made a note that "after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module." I realize that way more than a few months have passed, but I'd still like to give my input. I really wish that `blist` would be made available in the `collections` module. I'm working on an open-source project right now that needs to use it and I'm really reluctant to include `blist` as a dependency, given that it would basically mean my package wouldn't be pip-installable on Windows machines. Thanks, Ram.
On Sat, Sep 20, 2014 at 08:18:54AM -0700, Ram Rachum wrote:
Hi everybody,
In 2007, this PEP was created that suggested integrating Daniel Stutzbach's blist into Python: http://legacy.python.org/dev/peps/pep-3128/
The PEP was rejected, but Raymond Hettinger made a note that "after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module."
I have not used blist, but I have no objection to it becoming a collections type if the author agrees.
I realize that way more than a few months have passed, but I'd still like to give my input. I really wish that `blist` would be made available in the `collections` module. I'm working on an open-source project right now that needs to use it and I'm really reluctant to include `blist` as a dependency, given that it would basically mean my package wouldn't be pip-installable on Windows machines.
Does it need to be a dependency though? You could make it optional. Assuming that blist has the same API as a list, you could do: try: from blist import blist except ImportError: blist = list which makes blist an optimization, if and when it is available. -- Steven
On Sun, Sep 21, 2014 at 02:37:39PM +1000, Steven D'Aprano wrote:
In 2007, this PEP was created that suggested integrating Daniel Stutzbach's blist into Python: http://legacy.python.org/dev/peps/pep-3128/
The PEP was rejected, but Raymond Hettinger made a note that "after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module."
I have not used blist, but I have no objection to it becoming a collections type if the author agrees.
It seems unsettling that we'd consider adding another special use collection to the stdlib, when more widely applicable generalizations of it are missing. In blist's case, it can (mostly) be trivially reimplemented using an ordered map, which the standard library lacks. The remainder of blist (IIRC) are some fancy slicing and merging methods that exploit the underlying structure. Even after reviewing the original PEP, the presence of OrderedDict (and particularly under that moniker) feels wrong. Since its addition, in every case I've encountered it in commercial code, the use has been superfluous, diabolically miscomprehended, or used as a hacky stand-in for some cleaner, simpler approach. Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
which makes blist an optimization, if and when it is available.
blist has functionality that would require significant boilerplate to replicate in normal code, so it's definitely not just useful as an optimization. David
So write a PEP for sorted dict and tree. I rarely seem to need ordereddict myself, but it makes a clean LRU cache. On Saturday, September 20, 2014, David Wilson <dw+python-ideas@hmmz.org> wrote:
On Sun, Sep 21, 2014 at 02:37:39PM +1000, Steven D'Aprano wrote:
In 2007, this PEP was created that suggested integrating Daniel Stutzbach's blist into Python: http://legacy.python.org/dev/peps/pep-3128/
The PEP was rejected, but Raymond Hettinger made a note that "after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module."
I have not used blist, but I have no objection to it becoming a collections type if the author agrees.
It seems unsettling that we'd consider adding another special use collection to the stdlib, when more widely applicable generalizations of it are missing. In blist's case, it can (mostly) be trivially reimplemented using an ordered map, which the standard library lacks. The remainder of blist (IIRC) are some fancy slicing and merging methods that exploit the underlying structure.
Even after reviewing the original PEP, the presence of OrderedDict (and particularly under that moniker) feels wrong. Since its addition, in every case I've encountered it in commercial code, the use has been superfluous, diabolically miscomprehended, or used as a hacky stand-in for some cleaner, simpler approach.
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
which makes blist an optimization, if and when it is available.
blist has functionality that would require significant boilerplate to replicate in normal code, so it's definitely not just useful as an optimization.
David _______________________________________________ Python-ideas mailing list Python-ideas@python.org <javascript:;> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (on iPad)
On 2014-09-21 07:36 +02:00, Guido van Rossum wrote:
So write a PEP for sorted dict and tree.
but let us all read it - and the implementation samples - exactly before even thinking about rejecing it ;-) as in the rejected PEP3128 I read (in the fifth paragraph of http://legacy.python.org/dev/peps/pep-3128/#use-case-trade-offs not counting the list): """ The performance for the LIFO use case could be improved to O(n) time, by caching a pointer to the right-most leaf within the root node. For lists that do not change size, the common case of sequential access could also be improved to O(n) time via caching in the root node. [...] """ which - not really being in the topic - I would rather expect to read: """ The performance for the LIFO use case could be improved to O(1) time, by caching a pointer to the right-most leaf within the root node. For lists that do not change size, the common case of sequential access could also be improved to O(1) time via caching in the root node. [...] """ At least this is what I would consider an enhancement over O(log n) and also expect from a single cached pointer implementation and the like. All the best, Stefan.
I rarely seem to need ordereddict myself, but it makes a clean LRU cache.
On Saturday, September 20, 2014, David Wilson <dw+python-ideas@hmmz.org <mailto:dw%2Bpython-ideas@hmmz.org>> wrote:
On Sun, Sep 21, 2014 at 02:37:39PM +1000, Steven D'Aprano wrote:
> > In 2007, this PEP was created that suggested integrating Daniel Stutzbach's > > blist into Python: http://legacy.python.org/dev/peps/pep-3128/ > > > > The PEP was rejected, but Raymond Hettinger made a note that "after a few > > months, I intend to poll comp.lang.python for BList success stories. If > > they exist, then I have no problem with inclusion in the collections > > module." > > I have not used blist, but I have no objection to it becoming a > collections type if the author agrees.
It seems unsettling that we'd consider adding another special use collection to the stdlib, when more widely applicable generalizations of it are missing. In blist's case, it can (mostly) be trivially reimplemented using an ordered map, which the standard library lacks. The remainder of blist (IIRC) are some fancy slicing and merging methods that exploit the underlying structure.
Even after reviewing the original PEP, the presence of OrderedDict (and particularly under that moniker) feels wrong. Since its addition, in every case I've encountered it in commercial code, the use has been superfluous, diabolically miscomprehended, or used as a hacky stand-in for some cleaner, simpler approach.
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
> which makes blist an optimization, if and when it is available.
blist has functionality that would require significant boilerplate to replicate in normal code, so it's definitely not just useful as an optimization.
David _______________________________________________ Python-ideas mailing list Python-ideas@python.org <javascript:;> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (on iPad)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, 21 Sep 2014 11:21:15 +0200 Stefan Drees <stefan@drees.name> wrote:
On 2014-09-21 07:36 +02:00, Guido van Rossum wrote:
So write a PEP for sorted dict and tree.
but let us all read it - and the implementation samples - exactly before even thinking about rejecing it ;-) as in the rejected PEP3128 I read (in the fifth paragraph of http://legacy.python.org/dev/peps/pep-3128/#use-case-trade-offs not counting the list):
""" The performance for the LIFO use case could be improved to O(n) time, by caching a pointer to the right-most leaf within the root node. For lists that do not change size, the common case of sequential access could also be improved to O(n) time via caching in the root node. [...] """
which - not really being in the topic - I would rather expect to read: """ The performance for the LIFO use case could be improved to O(1) time, by caching a pointer to the right-most leaf within the root node. For lists that do not change size, the common case of sequential access could also be improved to O(1) time via caching in the root node. [...] """
At least this is what I would consider an enhancement over O(log n) and also expect from a single cached pointer implementation and the like.
I suspect O(n) means when n doing LIFO iterations, so O(1) amortized. Also, the LIFO (i.e. stack) use case was important for the prospect of replacing the list type with blist. If blist is merely a new container, the LIFO use case is perfectly satisfied by the list type already. By the way, one should remember the PEP was written years ago. I don't know how much the blist type has changed (the author doesn't seem to provide a changelog, unfortunately), but according to the following benchmarks there doesn't seem to be any remaining performance drawback compared to list: http://stutzbachenterprises.com/performance-blist Regards Antoine.
On Sun, Sep 21, 2014 at 2:54 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
By the way, one should remember the PEP was written years ago. I don't know how much the blist type has changed
From memory, below are the major changes that were made after the PEP was written.
Functionality: - Added sorteddict, sortedlist, sortedset, weaksortedlist, weaksortedset types. - Added a btuple type (works like a regular tuple but slices use copy-on-write). Performance: - Added a cache to find the leaf nodes in one step for code that's dominated by __getitem__ operations. - When sorting, switch to a O(n) radix sort if all of the keys are C integers or floats. - Lots of fine-tuning. (the author doesn't seem to provide a changelog, unfortunately), but The changelog is at: https://github.com/DanielStutzbach/blist/commits/master according to the following
benchmarks there doesn't seem to be any remaining performance drawback compared to list:
Microbenchmarks should always be taken with a grain of salt. :-) I was hoping http://speed.python.org/ create an opportunity to make some macrobenchmark comparisons, but the project stalled. -- Daniel Stutzbach
On 21 September 2014 15:18, David Wilson <dw+python-ideas@hmmz.org> wrote:
Even after reviewing the original PEP, the presence of OrderedDict (and particularly under that moniker) feels wrong. Since its addition, in every case I've encountered it in commercial code, the use has been superfluous, diabolically miscomprehended, or used as a hacky stand-in for some cleaner, simpler approach.
The main intended use case for OrderedDict was preserving insertion order, such as when executing code, parsing a JSON file, or maintaining an LRU cache. For many cases involving a *sorted* key, just sorting when necessary is often easier than preserving sort order on every update (not necessarily *faster*, but often fast enough that the extra dependency isn't justified). That's the background any proposal needs to compete against: OrderedDict covers preserving insertion order, while actually sorting the keys or items when the sorted order is needed covers the sorting case. The needle to be threaded to get a "sorted container" into the standard library is clearly explaining *in terms a relatively junior programmer can understand* when you would use it over dict or OrderedDict. In particular, it likely needs to explain the trade-offs between maintaining sort order on insert, and using an unordered container that is sorted for display or serialisation. That justification needs to be in the PEP to justify the initial inclusion, but also in the eventual docs to help folks know when their current problem is the "it" in There Should Be One Obvious Way To Do It for the new type. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, 21 Sep 2014 15:50:32 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 21 September 2014 15:18, David Wilson <dw+python-ideas@hmmz.org> wrote:
Even after reviewing the original PEP, the presence of OrderedDict (and particularly under that moniker) feels wrong. Since its addition, in every case I've encountered it in commercial code, the use has been superfluous, diabolically miscomprehended, or used as a hacky stand-in for some cleaner, simpler approach.
The main intended use case for OrderedDict was preserving insertion order, such as when executing code, parsing a JSON file, or maintaining an LRU cache.
For many cases involving a *sorted* key, just sorting when necessary is often easier than preserving sort order on every update (not necessarily *faster*, but often fast enough that the extra dependency isn't justified).
Except when precisely you need to preserve sort order after every update ;-) Which is at least O(n) using .sort(), but O(log n) using an appropriate structure. That said, I agree with your basic point that OrderedDict is quite useful in itself, and not a poor man's replacement for a binary tree. (what's more, the O(1) lookup behaviour makes it superior to a tree for the many cases where lookup performance is dominant) Regards Antoine.
On Sun, 21 Sep 2014 05:18:38 +0000 David Wilson <dw+python-ideas@hmmz.org> wrote:
On Sun, Sep 21, 2014 at 02:37:39PM +1000, Steven D'Aprano wrote:
In 2007, this PEP was created that suggested integrating Daniel Stutzbach's blist into Python: http://legacy.python.org/dev/peps/pep-3128/
The PEP was rejected, but Raymond Hettinger made a note that "after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module."
I have not used blist, but I have no objection to it becoming a collections type if the author agrees.
It seems unsettling that we'd consider adding another special use collection to the stdlib, when more widely applicable generalizations of it are missing. In blist's case, it can (mostly) be trivially reimplemented using an ordered map, which the standard library lacks.
But can it be *efficiently* reimplemented using an ordered map? blist uses chunks of 128 pointers, which makes it almost as memory efficient as a standard list. A traditional ordered map (I suppose you mean some kind of balanced tree) would generally show much more overhead, if only because it has to store the keys explicitly. And also, you now have to do costly key comparisons every time you do a lookup ("costly" because they go through the object layer and its indirections, as opposed to simple integer arithmetic).
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
This sounds like a false dilemma. We could have a collections.blist *and* a collections.Tree. Regards Antoine.
Long time lurker, first time poster. I think there may be multiple discussions happening here so I wanted to highlight a competing module. I think blist.blist is an excellent data type with a lot of value. But the performance graphs can be a bit misleading if you think they apply to the sortedlist, sorteddict, and sortedset types. In those scenarios, I do not believe blist is the best choice. The SortedContainers module (https://pypi.python.org/pypi/sortedcontainers) provides SortedList, SortedDict, and SortedSet data types. It is implemented in pure-Python, has 100% coverage and hours of stress testing. The API implemented is very close to blist's and a lot of effort has been put into documentation (http://www.grantjenks.com/docs/sortedcontainers/). Furthermore, the data types provided are often faster than their blist counterparts. Extensive performance comparisons against other implementations of sorted list, sorted dict, and sorted set types are documented ( http://www.grantjenks.com/docs/sortedcontainers/performance.html) along with a comparison of runtimes and load-factors (similar to blist, sortedcontainers uses a modified B-tree but with a tunable node size.) For SortedList, an analysis of use cases on Github has been made. These describe five use cases with performance analysis ( http://www.grantjenks.com/docs/sortedcontainers/performance-workload.html) Disclaimer: I am the author of the Python SortedContainers project. Feedback welcome. Grant Jenks On Sun, Sep 21, 2014 at 3:04 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 21 Sep 2014 05:18:38 +0000 David Wilson <dw+python-ideas@hmmz.org> wrote:
On Sun, Sep 21, 2014 at 02:37:39PM +1000, Steven D'Aprano wrote:
In 2007, this PEP was created that suggested integrating Daniel Stutzbach's blist into Python: http://legacy.python.org/dev/peps/pep-3128/
The PEP was rejected, but Raymond Hettinger made a note that "after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module."
I have not used blist, but I have no objection to it becoming a collections type if the author agrees.
It seems unsettling that we'd consider adding another special use collection to the stdlib, when more widely applicable generalizations of it are missing. In blist's case, it can (mostly) be trivially reimplemented using an ordered map, which the standard library lacks.
But can it be *efficiently* reimplemented using an ordered map? blist uses chunks of 128 pointers, which makes it almost as memory efficient as a standard list. A traditional ordered map (I suppose you mean some kind of balanced tree) would generally show much more overhead, if only because it has to store the keys explicitly. And also, you now have to do costly key comparisons every time you do a lookup ("costly" because they go through the object layer and its indirections, as opposed to simple integer arithmetic).
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
This sounds like a false dilemma. We could have a collections.blist *and* a collections.Tree.
Regards
Antoine.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sep 21, 2014, at 17:30, Grant Jenks <grant.jenks@gmail.com> wrote:
Long time lurker, first time poster. I think there may be multiple discussions happening here so I wanted to highlight a competing module.
I think blist.blist is an excellent data type with a lot of value. But the performance graphs can be a bit misleading if you think they apply to the sortedlist, sorteddict, and sortedset types. In those scenarios, I do not believe blist is the best choice.
The SortedContainers module (https://pypi.python.org/pypi/sortedcontainers) provides SortedList, SortedDict, and SortedSet data types. It is implemented in pure-Python, has 100% coverage and hours of stress testing. The API implemented is very close to blist's and a lot of effort has been put into documentation (http://www.grantjenks.com/docs/sortedcontainers/). Furthermore, the data types provided are often faster than their blist counterparts.
Honestly, this is exactly why I think, despite what I've said in the past, maybe we don't need SortedDict, etc. in the stdlib. The API is something nontrivial, but arguably with a single right answer (and half the implementations I've seen on PyPI get either SortedSet or SortedList wrong), so that makes perfect sense to belong in the stdlib, in collections.abc. But the implementation, there are multiple right answers: B-tree, binary tree, skip list, array that sorts after every insert or before every lookup after an insert, same plus hash... Which one is "best" depends entirely on your data and how you intend to use them. The fact most other languages out there go with red-black trees, but it's hard to find a Python application where those actually perform best, seems like further argument against picking a best solution. Also, making people glance at the docs and see that they have to "pip install SortedCollections" seems like just enough hurdle to slow down the kind of people who have no idea why they're using it beyond "I saw some Java code that used a sorted map", without being a serious roadblock for anyone who actually does need it. So, unless Nick's stdlib++ idea is shot down, I think the best thing to do would be: Add the ABCs, then link to a few different libraries on PyPI--blist, SortedCollections, bintrees, the one I forget the name of that wraps up the sort-an-OrderedDict-on-the-fly recipe that the docs already link to, while encouraging the authors of those packages to implement the APIs and to provide wheels if they have any extension modules.
On Sun, Sep 21, 2014 at 6:32 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
The API is something nontrivial, but arguably with a single right answer (and half the implementations I've seen on PyPI get either SortedSet or SortedList wrong), so that makes perfect sense to belong in the stdlib, in collections.abc.
+1 It would give the dozen or so implementations an API under which to unify and it would make performance comparisons easier for end-users if they could just drop-in an alternative. Also, I think when Andrew refers to SortedCollections, he means SortedContainers. Unless that was a kind of placeholder name. Grant
On 22/09/2014 02:32, Andrew Barnert wrote:
On Sep 21, 2014, at 17:30, Grant Jenks <grant.jenks@gmail.com <mailto:grant.jenks@gmail.com>> wrote:
Long time lurker, first time poster. I think there may be multiple discussions happening here so I wanted to highlight a competing module.
I think blist.blist is an excellent data type with a lot of value. But the performance graphs can be a bit misleading if you think they apply to the sortedlist, sorteddict, and sortedset types. In those scenarios, I do not believe blist is the best choice.
The SortedContainers module (https://pypi.python.org/pypi/sortedcontainers) provides SortedList, SortedDict, and SortedSet data types. It is implemented in pure-Python, has 100% coverage and hours of stress testing. The API implemented is very close to blist's and a lot of effort has been put into documentation (http://www.grantjenks.com/docs/sortedcontainers/). Furthermore, the data types provided are often faster than their blist counterparts.
Honestly, this is exactly why I think, despite what I've said in the past, maybe we don't need SortedDict, etc. in the stdlib.
The API is something nontrivial, but arguably with a single right answer (and half the implementations I've seen on PyPI get either SortedSet or SortedList wrong), so that makes perfect sense to belong in the stdlib, in collections.abc.
But the implementation, there are multiple right answers: B-tree, binary tree, skip list, array that sorts after every insert or before every lookup after an insert, same plus hash... Which one is "best" depends entirely on your data and how you intend to use them. The fact most other languages out there go with red-black trees, but it's hard to find a Python application where those actually perform best, seems like further argument against picking a best solution.
Also, making people glance at the docs and see that they have to "pip install SortedCollections" seems like just enough hurdle to slow down the kind of people who have no idea why they're using it beyond "I saw some Java code that used a sorted map", without being a serious roadblock for anyone who actually does need it.
So, unless Nick's stdlib++ idea is shot down, I think the best thing to do would be: Add the ABCs, then link to a few different libraries on PyPI--blist, SortedCollections, bintrees, the one I forget the name of that wraps up the sort-an-OrderedDict-on-the-fly recipe that the docs already link to, while encouraging the authors of those packages to implement the APIs and to provide wheels if they have any extension modules.
There's all sorts of interesting things discussed here http://kmike.ru/python-data-structures/ so if you didn't know about it you do now :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On 22 September 2014 01:30, Grant Jenks <grant.jenks@gmail.com> wrote:
I think blist.blist is an excellent data type with a lot of value. But the performance graphs can be a bit misleading if you think they apply to the sortedlist, sorteddict, and sortedset types. In those scenarios, I do not believe blist is the best choice.
The SortedContainers module (https://pypi.python.org/pypi/sortedcontainers) provides SortedList, SortedDict, and SortedSet data types.
Disclaimer: I am the author of the Python SortedContainers project. Feedback welcome.
How about positive feedback? I think your module is ridiculously awesome. However, despite keeping these in the back of my mind for times I might want to use them, I so rarely seem to. With the overhead of a few small Python files being so low, and the need for these collections being so little, I honestly can't recommend this for the standard library. And if I can't recommend this, I definitely can't recommend the blist alternatives I view as typically worse. I think there is merit in considering a blist.blist as the standard list type; the asymptotics are pretty awesome. We'd be part of a language that's trying something I don't think any other language has dared. But with the amount of one-directional iteration, it seems like the risks are high. I would also hesitate to do anything that would damage PyPy's claim to fame, seeing as it's faster than C++ in some benchmarks¹ and that is the kind of crown you want to keep. But having a second list type would mostly prevent the it from ever showing up because people just don't test with multiple list types. Hooking into speed.pypy.org and getting positive results would really invigorate the debate. ¹ https://gist.github.com/Veedrac/d25148faf20669589993 This is actually a somewhat real-life scenario; I was helping someone yesterday on their thesis where they were optimising some Python code which had exactly the sort of characteristics that this benchmark shows off. Getting close to the speed of C++ by just changing interpreters is a much better option than rewriting in C++ and missing all the optimisations you have to do for it to run fast!
On Sun, Sep 21, 2014, at 01:18, David Wilson wrote:
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
As I understand it, the problem with a tree, SortedDict/SortedSet, or in general any collection that relies on comparison relationships (heap, etc), is: unlike hashing (where Hashable implies an immutable hash relationship), there is no way to detect whether an object implements an immutable well-defined ordering. And that, unlike a mutable hashed object (which can AIUI only lose itself), a mutable sorted object (or a badly-behaved one like NaN) can cause other objects in the set to be inserted in the wrong place or not found.
On Mon, Sep 22, 2014 at 11:48 AM, <random832@fastmail.us> wrote:
On Sun, Sep 21, 2014, at 01:18, David Wilson wrote:
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
As I understand it, the problem with a tree, SortedDict/SortedSet, or in general any collection that relies on comparison relationships (heap, etc), is: unlike hashing (where Hashable implies an immutable hash relationship), there is no way to detect whether an object implements an immutable well-defined ordering. And that, unlike a mutable hashed object (which can AIUI only lose itself), a mutable sorted object (or a badly-behaved one like NaN) can cause other objects in the set to be inserted in the wrong place or not found.
A good point, too often overlooked. We could introduce a convention requiring __hash__() even though it is not used by the implementation. After all, if the object is immutable when it comes to a well-defined ordering, it is immutable when it comes to equality. I don't really want to think about classes that are immutable when it comes to == but not when it comes to <; that seems terrible. -- --Guido van Rossum (python.org/~guido)
On Mon, Sep 22, 2014 at 11:54:03AM -0700, Guido van Rossum wrote:
On Mon, Sep 22, 2014 at 11:48 AM, <random832@fastmail.us> wrote:
As I understand it, the problem with a tree, SortedDict/SortedSet, or in general any collection that relies on comparison relationships (heap, etc), is: unlike hashing (where Hashable implies an immutable hash relationship), there is no way to detect whether an object implements an immutable well-defined ordering. And that, unlike a mutable hashed object (which can AIUI only lose itself), a mutable sorted object (or a badly-behaved one like NaN) can cause other objects in the set to be inserted in the wrong place or not found.
A good point, too often overlooked.
Another concern is structures that rely on comparison could become performance hazards as they are mixed with user subclasses that perform nontrivial work in their __lt__ methods, and suchlike. That's a potentially undesirable trait for a built-in type. Revisiting my previous concern for blist, I realize it can't be separated from any hypothetical Tree or SortedDict or whatever else -- it's not possible to avoid every new addition to the stdlib from becoming an "attractive nuisance". A SortedDict would be as open to misuse as OrderedDict is today, and there would always be calls for a blist literal, etc. My only thought would be to make the interface to such classes sufficiently weird as to ward the unwary away, i.e. not to support the dict interface. Grant's reply (accidentally?) went furthest in convincing me there is sufficient variety in the structures available that fulfill this need, that I'm probably unqualified to comment on which (if any) are appropriate for inclusion in the stdlib. David
On Mon, Sep 22, 2014 at 12:19 PM, David Wilson <dw+python-ideas@hmmz.org> wrote:
Another concern is structures that rely on comparison could become performance hazards as they are mixed with user subclasses that perform nontrivial work in their __lt__ methods, and suchlike. That's a potentially undesirable trait for a built-in type.
I don't get this objection. If a user-defined subclass has slow comparisons then the container containing it becomes slow. You can't blame that on the (fast) base class nor on the container type. The same applies to ==: if I write a str subclass that overrides __eq__ with a slower version than dict lookups become slower. The solution is simple: don't do that. -- --Guido van Rossum (python.org/~guido)
On Sep 22, 2014, at 11:54, Guido van Rossum <guido@python.org> wrote:
On Mon, Sep 22, 2014 at 11:48 AM, <random832@fastmail.us> wrote:
On Sun, Sep 21, 2014, at 01:18, David Wilson wrote:
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
As I understand it, the problem with a tree, SortedDict/SortedSet, or in general any collection that relies on comparison relationships (heap, etc), is: unlike hashing (where Hashable implies an immutable hash relationship), there is no way to detect whether an object implements an immutable well-defined ordering. And that, unlike a mutable hashed object (which can AIUI only lose itself), a mutable sorted object (or a badly-behaved one like NaN) can cause other objects in the set to be inserted in the wrong place or not found.
A good point, too often overlooked.
We could introduce a convention requiring __hash__() even though it is not used by the implementation. After all, if the object is immutable when it comes to a well-defined ordering, it is immutable when it comes to equality. I don't really want to think about classes that are immutable when it comes to == but not when it comes to <; that seems terrible.
I went through this and other API issues in http://stupidpythonideas.blogspot.com/2013/07/sorted-collections-in-stdlib.h... In addition to the mutability issue, there's also nothing stopping you from adding elements that don't have even a total weak order. You will sometimes get an exception, but often get away with it even though you shouldn't. Most of the existing implementations seem to take a "consenting users" approach: they document that only immutable objects with a total (strong) order are supported, but don't attempt to check or enforce that, and I didn't find a single bug report or StackOverflow question or rant blog complaining about that. Obviously the bar would be raised a bit for a collection in, or even recommended by, the stdlib, but I think this might still be acceptable. (Notice that even statically typed languages like C++ and Swift only validate that the objects are less-than-comparable to the same type, not that they're comparable to all possible values of that type.)
I think the consenting users model is fine. Again, it's not so different for __eq__ and __hash__: you can define a hash on a mutable object and get into a big mess. (However, there are limits to the mess you can get yourself into: the dict implementation shouldn't crash or read or write memory locations it shouldn't.) On Mon, Sep 22, 2014 at 3:20 PM, Andrew Barnert < abarnert@yahoo.com.dmarc.invalid> wrote:
On Sep 22, 2014, at 11:54, Guido van Rossum <guido@python.org> wrote:
On Mon, Sep 22, 2014 at 11:48 AM, <random832@fastmail.us> wrote:
On Sun, Sep 21, 2014, at 01:18, David Wilson wrote:
Coming from this perspective, I'd prefer that further additions were limited to clean and far better understood structures. In this light, could we perhaps instead discuss the merits of a collections.Tree, collections.SortedDict or similar?
As I understand it, the problem with a tree, SortedDict/SortedSet, or in general any collection that relies on comparison relationships (heap, etc), is: unlike hashing (where Hashable implies an immutable hash relationship), there is no way to detect whether an object implements an immutable well-defined ordering. And that, unlike a mutable hashed object (which can AIUI only lose itself), a mutable sorted object (or a badly-behaved one like NaN) can cause other objects in the set to be inserted in the wrong place or not found.
A good point, too often overlooked.
We could introduce a convention requiring __hash__() even though it is not used by the implementation. After all, if the object is immutable when it comes to a well-defined ordering, it is immutable when it comes to equality. I don't really want to think about classes that are immutable when it comes to == but not when it comes to <; that seems terrible.
I went through this and other API issues in
http://stupidpythonideas.blogspot.com/2013/07/sorted-collections-in-stdlib.h...
In addition to the mutability issue, there's also nothing stopping you from adding elements that don't have even a total weak order. You will sometimes get an exception, but often get away with it even though you shouldn't.
Most of the existing implementations seem to take a "consenting users" approach: they document that only immutable objects with a total (strong) order are supported, but don't attempt to check or enforce that, and I didn't find a single bug report or StackOverflow question or rant blog complaining about that.
Obviously the bar would be raised a bit for a collection in, or even recommended by, the stdlib, but I think this might still be acceptable. (Notice that even statically typed languages like C++ and Swift only validate that the objects are less-than-comparable to the same type, not that they're comparable to all possible values of that type.)
-- --Guido van Rossum (python.org/~guido)
On Sat, Sep 20, 2014 at 9:37 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I have not used blist, but I have no objection to it becoming a collections type if the author agrees.
While I would be delighted to see blist in the collections package, I worry that it would create a burden on other implementations of Python. I did make a prototype of blist in Python before I wrote the C implementation, but the performance of the Python version was abysmal in practice.
which makes blist an optimization, if and when it is available.
If someone uses blist, it usually means they're relying on the better asymptotic performance for certain operations, and the code will be unacceptably slow when using list. -- Daniel Stutzbach
On 20/09/2014 16:18, Ram Rachum wrote:
Hi everybody,
In 2007, this PEP was created that suggested integrating Daniel Stutzbach's blist into Python: http://legacy.python.org/dev/peps/pep-3128/
The PEP was rejected, but Raymond Hettinger made a note that "after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module."
I realize that way more than a few months have passed, but I'd still like to give my input. I really wish that `blist` would be made available in the `collections` module. I'm working on an open-source project right now that needs to use it and I'm really reluctant to include `blist` as a dependency, given that it would basically me an my package wouldn't be pip-installable on Windows machines.
Thanks, Ram.
I think we need one PEP that says "let's include everything from pypi in the stdlib". Pros - saves writing and reiewing lots of PEPs. Cons - maintainance of Python is made fractionally more difficult. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
participants (13)
-
Andrew Barnert
-
Antoine Pitrou
-
Daniel Stutzbach
-
David Wilson
-
Grant Jenks
-
Guido van Rossum
-
Joshua Landau
-
Mark Lawrence
-
Nick Coghlan
-
Ram Rachum
-
random832@fastmail.us
-
Stefan Drees
-
Steven D'Aprano