Adding a built-in data structure with binary search tree semantics
In other languages, there are built-in data structures with binary search tree semantics, as an example, std::map in C++.
By binary search tree semantics we mean the following:
* We can insert elements into the structure in time faster than O(n), i.e. O(logn) for std::map
* We can iterate over all of the elements of the structure in sorted order in linear time, i.e. O(n) for std::map
Is there currently any data structure in Python with binary search tree semantics?
To illustrate the need, can the following program be written elegantly and efficiently with built-in Python data structures?
Imagine that you are a fund manager at a bank. Bank accounts are added to your fund very frequently. A bank account has an ID and a value. A smaller ID means that the bank account was created earlier than one with a larger ID. Each time that an account is added to your fund, you would like to know the ID number of the 1000th oldest account, for your records.
Here are implementations in Python 3 and C++, for comparison:
```python3
from random import randint
dist = lambda: randint(0, 2147483647)
def main():
my_map = {}
# fills a map with 1000000 random mappings of type (int, int)
for i in range(0, 1000000):
my_map[dist()] = dist()
# prints out the 1000th smallest key and its value
target_key = nth_smallest_key(1000, my_map)
print("({}: {})".format(target_key, my_map[target_key]))
# writes a new random mapping to the map
# then prints out the 1000th smallest key and its value if that key
# has changed
# 100000 times
for i in range(100000):
my_map[dist()] = dist()
test_key = nth_smallest_key(1000, my_map)
if target_key != test_key:
target_key = test_key
print("({}: {})".format(target_key, my_map[target_key]))
# print an indicator every 100 iterations for comparison
if i % 100 == 0:
print("iteration: {}".format(i))
def nth_smallest_key(n, m):
return sorted(m.keys())[n]
if __name__ == "__main__":
main()
```
```c++
#include <cstdio>
#include <map>
#include <random>
using namespace std;
int nth_smallest_key(int n, map
On Sat, 14 Mar 2020 20:35:08 -0000 1njection1@gmail.com wrote:
In other languages, there are built-in data structures with binary search tree semantics, as an example, std::map in C++.
By binary search tree semantics we mean the following:
* We can insert elements into the structure in time faster than O(n), i.e. O(logn) for std::map
* We can iterate over all of the elements of the structure in sorted order in linear time, i.e. O(n) for std::map
Is there currently any data structure in Python with binary search tree semantics?
To illustrate the need, can the following program be written elegantly and efficiently with built-in Python data structures?
Imagine that you are a fund manager at a bank. Bank accounts are added to your fund very frequently. A bank account has an ID and a value. A smaller ID means that the bank account was created earlier than one with a larger ID. Each time that an account is added to your fund, you would like to know the ID number of the 1000th oldest account, for your records.
Have you considered Python's heapq module? It fails to meet your second semantic (iterating in linear time), but should work pretty well for your stated use case (finding the 1000th oldest element in a heap). Also, if you're talking about bank accounts, ISTM that the persistent, secure storage mechanisms would swamp the time for finding the 1000th oldest account. Or are all of your account data only stored in memory? Dan -- “Atoms are not things.” – Werner Heisenberg Dan Sommers, http://www.tombstonezero.net/dan
In other languages, there are built-in data structures with binary search tree semantics, as an example, std::map in C++.
not being a C++ guy, I had no idea that std::map was a tree -- I always figured it was a hashtable, but you learn something new every day. As far as I know, there is no built in data structure that is ready to go, but some tools are there. Before I go further, the obvious think to do is see if there's a third party package that does what you want -- there are a lot of hits for "binary tree" on PyPi -- no idea if any of them are robust, mature, maintained, and meet your needs, but do go look.
* We can insert elements into the structure in time faster than O(n), i.e. O(logn) for std::map
* We can iterate over all of the elements of the structure in sorted order in linear time, i.e. O(n) for std::map
Is there currently any data structure in Python with binary search tree semantics?
To be pedantic, I don't think you are looking for binary search semantics, but rather, binary tree performance :-) This version is taking advantage of the fact that dicts now preserve order. So what it does is keep the keys in order, but inserting items in the right place. And it used the stdlib bisect module to work with the keys efficiently. In theory, it should have: O(logN) insert O(logN) retrieval on the nth key, value, item O(1) retrieval of a particular key However - The constant for insertion are huge: it has to rebuild the dict on every insertion - The constant for retrieval by order is kinda big -- it has to make a list out of the keys. And this implementation has all sorts of possible optimizations that I haven't bother with yet. A big one is that it might make sense to keep the sorted list of the keys around, as you can't use bisect() on the dict.keys() iterator. Which makes me think that we should dump the dict altogether, and simply use two lists: on of keys, and one of values. keep the keys list sorted, and you can use bisect to insert and retrieve and there you go. I've tried this code on your example, and it appears to work, but it is painfully slow. However, your code mingles creation and retrieval, and generating random numbers, etc, so it's a bit hard to tease out where the performance hit is. I would say that you want to think about what performance you care about: if you are going to be adding things much less often that retrieving them, and you want good "get this specific key" performance, then this approach (optimized) might be OK. Note that this code has a few tests in it -- if you run it with pytest: pytest sorted_dict.py the tests will run. They are not the least bit comprehensive, but enough to play around with optimizations and no you haven't broken anything too badly. Hmm -- maybe I'll go try a version simply with lists now... Also, if you're talking about bank accounts, ISTM that the persistent,
secure storage mechanisms would swamp the time for finding the 1000th oldest account. Or are all of your account data only stored in memory?
Yes, this example would probably be in a database, but I'm going to assume it's a motivating example, and the general case of wanting a data structure that's efficient for this kind of data is reasonable. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, Mar 15, 2020 at 12:28 PM Alex Hall
How about http://www.grantjenks.com/docs/sortedcontainers/sorteddict.html ?
For my part, I didn't look for that, as I was having fun playing around with the idea. But yes, that looks like it would fit the bill. And at a glance, they were smarter than me :-) -- no need to keep the underlying dict in sorted order, just find the key in the list, then use the regular O(1) dict access. But since I'm having fun, enclosed is an (incomplete) implementation of a a SortedMap. This one keeps a list of keys and values in sorted order, then can access things in O(logN). Insert is O(N), as it's using a regular old list internally. But that is all in C, and not too bad in practice. With the OP's example, it's slower than the basic dict method for creating, but much faster for finding items. (not formally timed) But back the python_ideas topic: It might be good to have a set of real, tree-based data structures -- they are the best way to go for some use cases, and are non-trivial to implement well. (and yes, starting out at a third party package, maybe one that's already out there, would be the way to go) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mon, Mar 16, 2020 at 7:14 AM Christopher Barker
But back the python_ideas topic:
It might be good to have a set of real, tree-based data structures -- they are the best way to go for some use cases, and are non-trivial to implement well.
(and yes, starting out at a third party package, maybe one that's already out there, would be the way to go)
Recommendation: If anything is added to the standard library, it should be named according to its use-cases, not its implementation. Python has a "dictionary", not a "hashtable". You might think *today* that a binary tree is the only way you'd ever want this to be implemented, but in twenty years' time, maybe there'll be a different data structure with equivalent semantics and better performance. Don't do what SourceMod did, and bake the name "trie" into its API... https://sm.alliedmods.net/new-api/adt_trie/CreateTrie
The word "Trie" in this API is historical. As of SourceMod 1.6, tries have been internally replaced with hash tables, which have O(1) insertion time instead of O(n).
:) ChrisA
On Sun, Mar 15, 2020 at 1:20 PM Chris Angelico
Recommendation: If anything is added to the standard library, it should be named according to its use-cases, not its implementation.
Agreed. Which is why I named my toy SortedMap :-) However, what if we made another, e.g. Mapping, with the only difference being different performance characteristics? Not sure what the appropriate name would be. -CHB Python has a "dictionary", not a "hashtable". You might think *today*
that a binary tree is the only way you'd ever want this to be implemented, but in twenty years' time, maybe there'll be a different data structure with equivalent semantics and better performance.
Don't do what SourceMod did, and bake the name "trie" into its API...
https://sm.alliedmods.net/new-api/adt_trie/CreateTrie
The word "Trie" in this API is historical. As of SourceMod 1.6, tries have been internally replaced with hash tables, which have O(1) insertion time instead of O(n).
:)
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6UA2KV... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 2020-03-15 5:19 p.m., Chris Angelico wrote:
On Mon, Mar 16, 2020 at 7:14 AM Christopher Barker
wrote: But back the python_ideas topic:
It might be good to have a set of real, tree-based data structures -- they are the best way to go for some use cases, and are non-trivial to implement well.
(and yes, starting out at a third party package, maybe one that's already out there, would be the way to go)
Recommendation: If anything is added to the standard library, it should be named according to its use-cases, not its implementation. Python has a "dictionary", not a "hashtable". You might think *today* that a binary tree is the only way you'd ever want this to be implemented, but in twenty years' time, maybe there'll be a different data structure with equivalent semantics and better performance.
Don't do what SourceMod did, and bake the name "trie" into its API...
https://sm.alliedmods.net/new-api/adt_trie/CreateTrie
The word "Trie" in this API is historical. As of SourceMod 1.6, tries have been internally replaced with hash tables, which have O(1) insertion time instead of O(n).
RangedDict? (range-restricted views on binary trees are super efficient.)
:)
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6UA2KV... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, 15 Mar 2020 at 20:27, Alex Hall
How about http://www.grantjenks.com/docs/sortedcontainers/sorteddict.html ?
There's also Sorted Containers, that seems to be a mature project: http://www.grantjenks.com/docs/sortedcontainers/index.html
On Mar 15, 2020, at 06:25, 1njection1@gmail.com wrote:
In other languages, there are built-in data structures with binary search tree semantics, as an example, std::map in C++.
There have been multiple past discussions on this, which you should search for and read, but I can summarize what I remember. Personally, I think it would probably be worth adding the SortedDict and maybe SortedSet types from Grant Jenks’ Sorted Containers library as collections.sorteddict, and matching ABCs to collections.abc, and linking to any popular alternatives in the docs. If not, then the collections docs should at least describe what sorted containers are for, and link to Sorted Containers and any other popular third-party alternatives, and the ABCs should be added anyway. Arguments below.
By binary search tree semantics we mean the following:
* We can insert elements into the structure in time faster than O(n), i.e. O(logn) for std::map * We can iterate over all of the elements of the structure in sorted order in linear time, i.e. O(n) for std::map
Do you really want to require “binary”? Binary trees are really inefficient in space, cache locality, and allocation costs. And memmove is really fast. So in most uses at small sizes it’s better to array-bisect for search, even if it means you have to shift for insert. And for large sizes, the same thing is true at every level, not just the leaves. So you really want something with branching somewhere between 16 and 8192, not 2. Again, this is just “most uses”, but it really is “most”. If you make it wide enough, you also may want to special-case the two lowest levels, because then you’re dealing with chunks on the order of a cache line and on the order of a VM page, and those are special enough to deserve special treatment. At which point it may even be worth simplifying things to, say, three fixed levels (which means huge collections do approach linear time—but if “huge” means “much larger than a billion elements”, and most people don’t have enough RAM for much more than a billion elements of a few dozen bytes each…). And the same applies to building a skiplist or any other logarithmic search structure. If you look at the documentation for blist and SortedContainers, they both put actual numbers to these points. Another useful thing about search semantics is that you can find all the elements between a min and max in logarithmic time. That is, if you want M contiguous elements out of N, you can do it in O(logN + M) time, instead of doing M searches for O(logN * M) time. But what should the interface for that be? And there are a few other API questions—for example, your own example would clearly benefit (both in readability and in performance) from having a way to jump to the Nth element without having to iterate the first N, but what should that look like? And that raises the larger question of whether there should be a SearchMap or SortedMap or similar ABC so that different alternatives can all provide the same thing. Anyway, here’s the history as I remember it: Back in the days when recipes on ActiveState were more important than libraries on PyPI, the docs linked to Raymond Hettinger’s simple sortedlist (which is just a list that sorts at construction, resorts on each insert, and bisects for search—great if you never need to modify it after building it, but otherwise not so much), and that linked to a bunch of other recipes. People built new tree libraries for years, and most of them got abandoned, even if there was some uptake. A lot of major projects versioned in and forked one library or another instead of relying on anything being kept up to date. Proposals to add something to the stdlib always failed quickly based on the argument that if nobody can even maintain a popular third-party lib to work through three versions of Python… One library, blist (a wide hybrid b-tree-like design) got somewhat popular, and its author proposed it to be added to the stdlib, offering to maintain it. But he wanted to also replace list and str with ropes built on the same foundation, which nobody else wanted. Also, nobody stepped up to built a pure-Python implementation of the same design so that Jython could provide the same types with similar performance characteristics. So, it failed. A few years later, in one of the periodic discussions of adding sorted containers to the stdlib, there was some consensus that maybe Python could just add an ABC so that various third-party libraries could converge on a single API (largely based on blist’s). There’d then be more chance of one of them becoming a category killer: even if it’s only ideal for 90% of use cases instead of 100% (it’s not impossible to come up with a case where a red-black tree wins if you really want to, but I think the real sticking point is people who want to wrap up a Java container in Jython, etc.), if the other designs are all available and work as drop-in replacement libraries, that’s fine. When that fizzled out too, Grant Jenks went off and built a new library, Sorted Containers, that implemented the consensus API (largely based on blist’s). After a few years of widespread use in the field, Grant revised the API to be “more Python-3-ish”, and l believe it’s been pretty stable since. become the most popular one. Some alternatives have changed their APIs to match, but even more have just deprecated their libraries in favor of using Sorted Containers. And I don’t think there are any new libraries that don’t either have an incomplete API or copy the Sorted Containers one. So, I think Grant’s Sorted Containers is a good-enough category killer to be worth adding today, even though it wasn’t in 2014. If not, at least the API is stable enough to be worth adding as an ABC. I believe he’s kept the comparisons section of his docs reasonably up to date as well, so if people are concerned that it’s only the right choice 90% of the time, the docs to link to third-party options for the other 10% should be easy to write up. But of course Grant Jenks is the person who really knows whether Sorted Containers, or at least its API, is “done” enough to move into the stdlib to die. If not, then I think the collections docs should at least link to Sorted Containers and any other relevant libraries.
On Sun, Mar 15, 2020 at 9:41 PM Andrew Barnert via Python-ideas
Do you really want to require “binary”?
I don't think so; they never talked about binary trees, only "binary search tree semantics." It could alternately be called autobalanced tree semantics or something.
Sorted Containers, that implemented the consensus API (largely based on blist’s).
One feature blist has that sortedcontainers seems to lack is O(1) copies. It would be nice to have that in a standard library. But maybe I shouldn't press for it as it might impose a significant constant-factor overhead on other operations. blist unfortunately seems to be at least slightly bitrotted at this point: iterating over a sortedset fails because of a method that assumes that StopIteration will escape from a generator. pyrsistent has O(1) copying and is maintained but it's terribly slow and the interface is nonstandard (Haskellish instead of Pythonic).
On Mar 16, 2020, at 00:13, Ben Rudiak-Gould
wrote: On Sun, Mar 15, 2020 at 9:41 PM Andrew Barnert via Python-ideas
wrote: Do you really want to require “binary”?
I don't think so; they never talked about binary trees, only "binary search tree semantics." It could alternately be called autobalanced tree semantics or something.
Why not just leave “tree” out of it? All logarithmic search containers are technically equivalent to some kind of tree, but try actually describing a skiplist’s operations in terms of how the equivalent tree is rebalanced. It’s a lot easier to just show that the skiplist’s operations are logarithmic in its own terms.
Sorted Containers, that implemented the consensus API (largely based on blist’s).
One feature blist has that sortedcontainers seems to lack is O(1) copies. It would be nice to have that in a standard library.
You don’t get O(1) copies for list, or dict, so why would you expect them for sorteddict? People who want data-sharing vectors go grab numpy instead of using the builtin list type. Even if the extra indirection overhead turns out not to be an issue, just from the added complexity (to every possible implementation) it seems like it would be a bad idea to make that a requirement.
pyrsistent has O(1) copying and is maintained but it's terribly slow and the interface is nonstandard (Haskellish instead of Pythonic).
It’s also neither sorted nor mutable, which means it’s pretty far from what you’re looking for here. (By the way, Python already has a similar HAMT implementation, and it may be exposed for user code in 3.9; see PEP 603.)
On Mon, Mar 16, 2020 at 12:57 AM Andrew Barnert
Even if the extra indirection overhead turns out not to be an issue, just from the added complexity (to every possible implementation) it seems like it would be a bad idea to make that a requirement.
The only change needed to support O(1) copy is making modified copies of nodes instead of mutating them in place. The algorithms are otherwise the same. That said, it's possible that blist does something that's faster but more complicated. I haven't looked.
pyrsistent has O(1) copying and is maintained but it's terribly slow and the interface is nonstandard (Haskellish instead of Pythonic).
It’s also neither sorted nor mutable, which means it’s pretty far from what you’re looking for here.
The distinction between mutability and immutability is an illusion. You can convert a blist-style data structure into a pyrsistent-style data structure with a wrapper that takes an O(1) copy of its internal blist object, mutates it, wraps it, and returns it. You can convert a pyrsistent-style data structure into a blist-style data structure with a wrapper that maintains an internal pyrsistent object and overwrites it with the modified copy after each operation. The complexity of the operations is unaffected and the underlying implementations can be identical. The key to this equivalence is the O(1) copying of the mutable version. Pyrsistent could switch to a blist backend but not to a sortedcontainers backend unless sortedcontainers added that operation. You're right that blist and pyrsistent don't have the same functionality as they stand. Pyrsistent's PMap and PSet are sorted but don't allow duplicates. The equivalent (modulo interface translation) of blist.sortedlist would be pyrsistent.PMultiSet.
Note that the OP used the word "semantics", but primarily meant "performance characteristics": * We can insert elements into the structure in time faster than O(n), i.e.
O(logn) for std::map * We can iterate over all of the elements of the structure in sorted order in linear time, i.e. O(n) for std::map
Though there is a little bit of that could be implicitly considered
semantics: "iterate over all of the elements of the structure in sorted
order".
We can solve that with a Mapping that guarantees sorted order, like the
SortedContainers do.
However, the OP also had this little function in the CPP version.
int nth_smallest_key(int n, map
On Mar 16, 2020, at 09:53, Christopher Barker
Note that the OP used the word "semantics", but primarily meant "performance characteristics":
IIRC, the C++ specification includes guaranteed performance characteristics as part of its library semantics, so an implementation whose std::map::find takes more than log N time isn’t conforming to C++ semantics. And I don’t think that’s too unreasonable. For example, Alex Martelli’s sorted dict based on a list that calls sort after each mutation is a sorted dict if you ignore performance characteristics, but I don’t think anyone would consider a general-purpose sorteddict type that has O(NlogN) rather than O(logN) insert cost usable. (For special cases where you rarely or never mutate after construction, on the other hand, it’s very usable, which is why it’s been a popular recipe for decades…)
Meanwhile, one really important bit of non-performance-related semantics that wasn’t mentioned: dict (like C++ std::unordered_map) works with any hashable keys; a sorteddict (like C++ std::map) doesn’t care about hashability but requires ordered keys.
In C++, that’s a type issue: std::map can only be instantiated with a key type that provides a strict weak ordering and an equivalence relation, which means that map
However, the OP also had this little function in the CPP version.
int nth_smallest_key(int n, map
& m) { map ::iterator curr = m.begin(); for (int i = 0; i < n; i++) { curr = next(curr); } return curr->first; } Which is taking advantage of the fact that the map iterates in sorted order.
But if you have something in sorted order, then iterating over it to get the nth object is a lot less efficient that it could be. If this is a common use-case (is it??) then it would be nice to provide an API for it more directly.
Which, it turns out, SortedDict does, in fact, do:
Yes, and it also includes things like key-slicing (“get the values for all keys a<=key
On Mon, Mar 16, 2020 at 11:36 AM Andrew Barnert
Note that the OP used the word "semantics", but primarily meant "performance characteristics":
IIRC, the C++ specification includes guaranteed performance characteristics as part of its library semantics,
Now we are getting into the semantics of the word "semantics" -- beware the RecursionError!
so an implementation whose std::map::find takes more than log N time isn’t conforming to C++ semantics.
I would argue that it's not conforming to the specification, even if it does conform to the semantics :-)
And I don’t think that’s too unreasonable.
Nor do I -- a specification should include something about performance characteristics, and we certainly wouldn't want to put anything in the standard library that didn't have reasonable performance for the intended use cases. After all, you CAN just use the existing ones and call sorted() all over the place :-) Anyway, my only point was that we DO want to consider both performance and semantics here -- a Sorted Dict (or SortedList, or ...) has different use cases, and thus should have a different semantics (API). I note that the SortedList, for instance, is not quite a MutableSequence, as it doesn't make sense to support .append() and .extend() for that. Not sure if there is anything about a SortedDict that makes it not a MutableMapping, though. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mar 16, 2020, at 11:52, Christopher Barker
On Mon, Mar 16, 2020 at 11:36 AM Andrew Barnert
wrote: Anyway, my only point was that we DO want to consider both performance and semantics here -- a Sorted Dict (or SortedList, or ...) has different use cases, and thus should have a different semantics (API).
I note that the SortedList, for instance, is not quite a MutableSequence, as it doesn't make sense to support .append() and .extend() for that.
Yes, everyone who designs a sorted container library runs into the fact that a SortedList is a Sequence, and mutable, but not even close to a MutableSequence—it’s not just append and extend; you can’t even do __setitem__. In fact, it’s a lot closer to a MutableSet (assuming you use the obvious names add and update), but it’s not that either, because it’s a multiset rather than a set. I believe SortedContainers 2.x’s SortedList inherits MutableSequence anyway, and raises NotImplementedError on __setitem__, append, etc. Which is a weird compromise, sometimes confusing but sometimes handy. I think the best solution is to just not have a SortedList. C++, Java, etc. don’t provide anything like that, only sorted dicts, sets, and multisets. (C# does have a SortedList, but it doesn’t act like a list, it acts like an ItemsView, in Python terms.) And it’s rare that you need something that’s sorted but also directly indexable; when you do, the answer in those languages is to write a simple wrapper around a multiset. For an all-encompassing third-party library, maybe it’s worth trying to do more, but for the stdlib, just leaving out sorted lists and multisets and maybe even sets seems like the easiest answer. 90% of the use cases for SortedContainers and its competitors are dicts. Every thread suggesting to add something similar over the decades, the use cases are all dicts. We can just add dicts instead of trying to cover everything anyone might want. And we can, and should, link to SortedContainers in the docs for people who want more and are willing to read what the compromises are.
Not sure if there is anything about a SortedDict that makes it not a MutableMapping, though.
There’s no problem there. IIRC, the original implementation of SortedContainers inherited from collections.abc.MutableMapping, and the new implementation just inherits dict (because it’s not a tree of items, it’s a tree of keys on top of a plain old dict). Which also means it requires keys that are _both_ hashable and weak-ordered…
On Mon, Mar 16, 2020 at 2:22 PM Andrew Barnert
Yes, everyone who designs a sorted container library runs into the fact that a SortedList is a Sequence, and mutable, but not even close to a MutableSequence—it’s not just append and extend; you can’t even do __setitem__. In fact, it’s a lot closer to a MutableSet (assuming you use the obvious names add and update), but it’s not that either, because it’s a multiset rather than a set.
I believe SortedContainers 2.x’s SortedList inherits MutableSequence anyway, and raises NotImplementedError on __setitem__, append, etc. Which is a weird compromise, sometimes confusing but sometimes handy.
There's a gitHUb issue about that -- not sure how it will be resolved. Personally a bigger fan of EAFTP Duck Typing than ABCs and isinstance(), but if you are going to inherit form an ABC, it should be correct :-) Otherwise it really kinda kills the point of any kind of type checking (dynamic or static), no?
I think the best solution is to just not have a SortedList.
Interesting -- I wonder if the SortedContainers folks have any idea how often it's used. 90% of the use cases for SortedContainers and its competitors are dicts.
No idea how to know for sure, but it's a lot easier ta add one, and wait to see if anyone clammers for more, than to add a bunch that are rarely used., and have to maintain or deprecate them. So it seems we have a pretty simple proposal here: Add SortedDict, much like implemented in the SortedContainers package, to the standard library's collections module. Now we just need someone to champion it. Oddly, I don't think the OP has responded at all to this thread :-) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mon, 16 Mar 2020 at 22:22, Andrew Barnert via Python-ideas
I think the best solution is to just not have a SortedList. C++, Java, etc. don’t provide anything like that
Guava has TreeMultiset: https://guava.dev/releases/snapshot/api/docs/com/google/common/collect/TreeM... The problem is: is there a use case? I mean, how much time you need to sort a list, add an element and re-sort it?
On Mar 19, 2020, at 11:29, Marco Sulla
On Mon, 16 Mar 2020 at 22:22, Andrew Barnert via Python-ideas
wrote: I think the best solution is to just not have a SortedList. C++, Java, etc. don’t provide anything like that
Guava has TreeMultiset: https://guava.dev/releases/snapshot/api/docs/com/google/common/collect/TreeM...
Sure. And C++ has a sorted multiset too. But std::multiset’s API is similar to set, and of course to the hash-based unordered_multiset; it’s not similar to list or vector (beyond the basic iterability). And that’s the point: a “sorted list” isn’t really like a list, and most languages don’t try to force it. A “sorted multiset” is a more sensible thing, and really is like an (unordered) multiset, and provides the behavior people want from a flat sorted collection that allows duplicates—but a language that doesn’t even bother with unordered multisets* like Python surely doesn’t need to add a sorted multiset just for completeness.
The problem is: is there a use case? I mean, how much time you need to sort a list, add an element and re-sort it?
And even if you do, how often do you need to do list-y things like slice it from the 10th to 19th indices or concatenate two of them, much less whatever it should mean to replace the value at a specific index? The SortedList idea is a mirage. At first glance it seems like an obviously useful parallel to SortedDict and SortedSet in the same way list parallels dict and set, but as soon as you try to think through how it should act and what you’d ever do with it, you realize that’s an illusion, and it’s actually a clunky design with no use case. * Yes, there is collections.Counter. But that doesn’t even provide an analog to the MutableSet API—there’s no add, etc. It’s a dict whose values are counts, which can be used (among other things) to simulate a multiset. If people want a collections.SortedCounter built on top of sorteddict instead of dict, that should be trivial to add.
On Thu, Mar 19, 2020 at 07:28:56PM +0100, Marco Sulla wrote:
On Mon, 16 Mar 2020 at 22:22, Andrew Barnert via Python-ideas
wrote: I think the best solution is to just not have a SortedList. C++, Java, etc. don’t provide anything like that
Guava has TreeMultiset: https://guava.dev/releases/snapshot/api/docs/com/google/common/collect/TreeM...
The problem is: is there a use case? I mean, how much time you need to sort a list, add an element and re-sort it?
Depends on whether you are adding and re-sorting *one* element, or if you have to process a hundred million elements, re-sorting after each one. -- Steven
On Fri, 20 Mar 2020 at 05:06, Steven D'Aprano
On Thu, Mar 19, 2020 at 07:28:56PM +0100, Marco Sulla wrote:
The problem is: is there a use case? I mean, how much time you need to sort a list, add an element and re-sort it? Depends on whether you are adding and re-sorting *one* element, or if you have to process a hundred million elements, re-sorting after each one.
Yes of course. But does it happen in the real life? Is there a concrete use-case?
On 16 Mar 2020, at 14:43, Ben Rudiak-Gould
wrote: On Mon, Mar 16, 2020 at 12:57 AM Andrew Barnert
wrote: Even if the extra indirection overhead turns out not to be an issue, just from the added complexity (to every possible implementation) it seems like it would be a bad idea to make that a requirement.
The only change needed to support O(1) copy is making modified copies of nodes instead of mutating them in place. The algorithms are otherwise the same.
So when I insert into the copy how do you avoid all the nodes being modified? Barry
That said, it's possible that blist does something that's faster but more complicated. I haven't looked.
pyrsistent has O(1) copying and is maintained but it's terribly slow and the interface is nonstandard (Haskellish instead of Pythonic).
It’s also neither sorted nor mutable, which means it’s pretty far from what you’re looking for here.
The distinction between mutability and immutability is an illusion. You can convert a blist-style data structure into a pyrsistent-style data structure with a wrapper that takes an O(1) copy of its internal blist object, mutates it, wraps it, and returns it. You can convert a pyrsistent-style data structure into a blist-style data structure with a wrapper that maintains an internal pyrsistent object and overwrites it with the modified copy after each operation. The complexity of the operations is unaffected and the underlying implementations can be identical.
The key to this equivalence is the O(1) copying of the mutable version. Pyrsistent could switch to a blist backend but not to a sortedcontainers backend unless sortedcontainers added that operation.
You're right that blist and pyrsistent don't have the same functionality as they stand. Pyrsistent's PMap and PSet are sorted but don't allow duplicates. The equivalent (modulo interface translation) of blist.sortedlist would be pyrsistent.PMultiSet. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FYK4IJ... Code of Conduct: http://python.org/psf/codeofconduct/
On Mar 16, 2020, at 10:39, Barry
On 16 Mar 2020, at 14:43, Ben Rudiak-Gould
wrote: On Mon, Mar 16, 2020 at 12:57 AM Andrew Barnert
wrote: Even if the extra indirection overhead turns out not to be an issue, just from the added complexity (to every possible implementation) it seems like it would be a bad idea to make that a requirement. The only change needed to support O(1) copy is making modified copies of nodes instead of mutating them in place. The algorithms are otherwise the same.
So when I insert into the copy how do you avoid all the nodes being modified?
You copy the node being inserted and all logN nodes on the path from there back to the root. Which doesn’t affect the algorithmic complexity, since insert is already amortized log N anyway. In practice this deep copy-on-write can be substantially more expensive, but depending on your usage patterns there are also cases where it can be a lot cheaper. (Most extremely, imagine you make a ton of unnecessary copies and never mutate any of them; obviously the CoW wins…) I don’t know why blist chose to go one way and Sorted Containers the other, but knowing the authors involved, I suspect they both thought about it and had good arguments for it with benchmarks and everything, so I don’t think it’s worth trying to re-argue from first principles without someone looking first. At any rate, I still stand by the point that nobody is going to expect O(1) copies for sorteddict when list, dict, etc. don’t do it.
participants (11)
-
1njection1@gmail.com
-
Alex Hall
-
Andrew Barnert
-
Barry
-
Ben Rudiak-Gould
-
Chris Angelico
-
Christopher Barker
-
Dan Sommers
-
Marco Sulla
-
Soni L.
-
Steven D'Aprano