Having Sorted Containers in stdlib?

Hi All, This is a modest proposal to consider having sorted containers (http://www.grantjenks.com/docs/sortedcontainers/ <http://www.grantjenks.com/docs/sortedcontainers/>) in standard library. I know that usually adding stuff to standard library requires some strong arguments, so I will try my best to give my reasons here: 1) Some mainstream language support them out of box: C++ for example have set/map which are sorted by the key order, and Java has TreeMap which is internally a Red-black tree. I understand languages might target different audiences, but I think Python’s standard library is quite extensive compared to peers. Consider we even have a sqlite driver in the stdlib, I do not think it is outrageous to have sorted containers. 2) These containers are not really easy to implement correctly, and usually is out of the scope of day-to-day projects. Especially considering we have a large audience of non-hardcore programmers in Python community. They may have the need to use these structures, but they do not necessarily have the skill/knowledge to implement it. 3) Granted, people can just pip install this library, but that is one extra step and less fraction is better for user experience. 4) These structures are very useful in competitive programming, I know at least in Leetcode this library is installed for Python. 5) The said library is of high implementation quality. I might be stupid here as I am not the maintainer of this library and I might be not even in a position to submit it to Python as part of stdlib, but here are some of my personal thoughts and would love to hear your opinion! Thanks! Bob

On Tue, 9 Nov 2021 at 16:01, Bob Fang <boyufang@bytedance.com> wrote:
As of 3.7. dicts are sorted[1], but I'm unsure if the order can be overridden. Indeed:
Boys = {'Tim': 18,'Charlie':12,'Robert':25}
Girls = {'Tiffany':22}
print(cmp(Girls, Boys))
Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'cmp' is not defined
Girls.__gt__(Boys)
NotImplemented -- H -- OpenPGP: https://hasan.d8u.us/openpgp.asc <https://t.mailpgn.com/l/?u=b8770bd0-a9ef-411d-8c87-1636e6d9b77a&fl=https%3A%2F%2Ft.mailpgn.com%2Fl%2F%3Fu%3D5a3f5883-7b16-4b11-80ad-be68170836a1%26amp%3Bfl%3Dhttps%253A%252F%252Fhasan.d8u.us%252Fopenpgp.asc> If you wish to request my time, please do so using *bit.ly/hd1AppointmentRequest <https://t.mailpgn.com/l/?u=8467f9a6-914d-4d6d-8c00-dd503e667146&fl=https%3A%2F%2Ft.mailpgn.com%2Fl%2F%3Fu%3D20c36f1d-4104-40d6-8c5e-b8bee61caaa4%26amp%3Bfl%3Dhttp%253A%252F%252Fbit.ly%252Fhd1AppointmentRequest>* . Si vous voudrais faire connnaisance, allez a *bit.ly/hd1AppointmentRequest <https://t.mailpgn.com/l/?u=be22c143-79b8-4a2b-b7f9-cb7a95a8e7e4&fl=https%3A%2F%2Ft.mailpgn.com%2Fl%2F%3Fu%3D143cb038-fda7-4317-aef6-37d17da9dfde%26amp%3Bfl%3Dhttp%253A%252F%252Fbit.ly%252Fhd1AppointmentRequest>* . <https://t.mailpgn.com/l/?u=b065018c-6b50-41c8-9881-508f9f2f30fa&fl=https%3A%2F%2Ft.mailpgn.com%2Fl%2F%3Fu%3D9c5843eb-43ed-470a-83fc-1ca514c43810%26amp%3Bfl%3Dhttps%253A%252F%252Fsks-keyservers.net%252Fpks%252Flookup%253Fop%253Dget%2526amp%253Bsearch%253D0xFEBAD7FFD041BBA1>Sent from my mobile device Envoye de mon portable 1. https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/ <https://t.mailpgn.com/l/?u=e7d71732-691a-421f-b43a-a9cdb3c2b9d0&fl=https%3A%2F%2Fsoftwaremaniacs.org%2Fblog%2F2020%2F02%2F05%2Fdicts-ordered%2F>

On Tue, 9 Nov 2021, 17:05 Bob Fang, <boyufang@bytedance.com> wrote:
But that’s in insertion order, not in key order right? I think we need data structure that are key-ordered.
According to the tests, it seems to be key-ordered. It also appears that the ordering cannot be changed (yet?). -- H

On Tue, Nov 09, 2021 at 04:23:50PM -0800, Hasan Diwan wrote:
As of 3.7. dicts are sorted[1], but I'm unsure if the order can be overridden.
Dicts are not sorted, they are kept in insertion order. >>> d = {3: 'a', 4: 'b', 1: 'c', 2: 'd'} >>> d {3: 'a', 4: 'b', 1: 'c', 2: 'd'} See the docs: https://docs.python.org/3/library/stdtypes.html#dict although you have to scroll almost all the way to then end, just before the dict view objects, to see it documented. Sorting dicts has been discussed on the Python-Ideas mailing list, it is too hard and expensive to justify for the limited use-cases for it. If you want to sort a dict, you are best to sort the dict's keys, then create a new dict. Or possibly use a dedicated Sorted Mapping type like a red-black tree or similar. Hasan wrote:
The 'cmp' built-in was removed in Python 3.0. In any case, dicts don't support order comparisons. -- Steve

On Tue, Nov 9, 2021 at 9:00 PM Steven D'Aprano <steve@pearwood.info> wrote:
There are several implementations of key-order sorted dicts available in Python, and more than a few of them are well tested. I am the maintainer of one of them: https://pypi.org/project/treap/ which I ported from Java. I also did a performance comparison among several of them a few years back. Red-black trees are popular, perhaps especially among Java developers, but I've never understood why. They aren't the fastest. Among traditional tree-based key-sorted dicts, treaps are frequently fastest. However, SortedDict is not uncommonly faster than treaps - I believe this is because SortedDict is very good at maximizing locality of reference. Traditional trees are almost always going to do an entire cache line hit for every node in large trees, even if those nodes are "next to each other" when sorted/traversed. SortedDicts put keys that are nearly equal, in nearly the same part of memory - so multiple values can be retrieved with a single cache line hit. Sorting a dict's keys inside a loop tends to give O(n^2) algorithms, and sometimes even O(n^2 * logn). This is not good. A traditional tree should give O(nlogn) algorithms under similar circumstances, and although my gut is telling me SortedDict is similar the presentation linked below suggests a different (though still better than sorting in a loop) runtime for SortedDict. It's true that key-sorted dicts are not all that common. Their most common use is probably for implementing finite caches - evictions can benefit from ordered keys. However, we already have functools.lru_cache. Here's my most recent performance comparison: https://stromberg.dnsalias.org/~strombrg/sorted-dictionary-comparison/Datast... Here's a PyCon 2016 presentation about SortedContainers, that includes SortedDict: https://www.youtube.com/watch?v=7z2Ki44Vs4E I think it would make sense to include at least one key-sorted dict in CPython, and feel that SortedDict would not be a bad way to go. Neither would treap.

Maybe a stupid question: What are use cases for sorted dicts? I don’t think I’ve ever needed one. Also, I can’t quite tell from the discussion If a “sorted dict” implements something new, or is an internal data structure that gives better performance for particular use cases. I.e. is a sorted dict a Mapping? -CHB On Tue, Nov 9, 2021 at 9:45 PM Dan Stromberg <drsalists@gmail.com> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Wed, Nov 10, 2021 at 5:03 PM Christopher Barker <pythonchb@gmail.com> wrote:
Me neither, tbh.
Also, I can’t quite tell from the discussion If a “sorted dict” implements something new, or is an internal data structure that gives better performance for particular use cases. I.e. is a sorted dict a Mapping?
Nothing's technically new. You could make an inefficient sorted dict like this: class SortedDict(dict): def __iter__(self): return iter(sorted(self.keys())) So in that sense, yes, it's better performance for the use-case of needing the keys to be in order. It's not something I have ever really needed - on the rare occasions when I need something like that, it's usually no harder to maintain a separate sorted list or something. IMO this is a great tool to have on PyPI. Someone can start the naive way, run into major performance problems, and then go "huh, maybe someone else has already solved this problem". Doesn't need to be in the stdlib for that. ChrisA

On Wed, Nov 10, 2021 at 05:11:33PM +1100, Chris Angelico wrote:
You would need more than that. You would want to ensure that the dict's repr shows it is sorted order. You would need popitem to pop items in the appropriate order. There may be other APIs that sorted dicts provide than merely sorting the keys on iteration. You don't actually need to implement this to see how repeatedly sorting the keys would give terrible performance for anything above small sets of data. Here's Java's standard SortedMap: https://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html That should give you some idea of the interface a sorted dict would likely provide.
You may have missed that this thread has already discussed two mature, good quality sorted dict implementations that have been on PyPI for years: https://pypi.org/project/sortedcontainers/ https://pypi.org/project/treap-python/ Here are a few more: https://pypi.org/project/ruamel.ordereddict/ https://github.com/surenkov/PySortedDict https://pypi.org/project/sorteddict/ -- Steve

On Wed, Nov 10, 2021 at 5:45 PM Steven D'Aprano <steve@pearwood.info> wrote:
Sure. That was a massively oversimplified one, to point out that the question of whether this is "truly new" is fairly meaningless. A thing can be extremely useful without actually being *new* per se.
And yes. That is exactly the problem.
I know it's on PyPI, in multiple forms. I'm saying that it's a great tool to keep there, not in the stdlib. ChrisA

On Tue, Nov 09, 2021 at 10:01:35PM -0800, Christopher Barker wrote:
Good question :-)
All dicts are mappings :-) I would expect any kind of sorted dict to support the full mutable mapping interface. Some mappings are not necessarily implemented as hash tables, as dict is. Some variety of tree is a common choice. I expect that a sorted dict (however it is implemented) would be a mapping that preserves sorted order on insertions and deletions, rather than insertion order. (Or some arbitrary order.) I haven't used one myself, but I would expect an API where you create a Sorted Dict (of whatever library or implementation you prefer), set an optional key function and direction (ascending or descending), then as you insert keys, the mapping keeps the keys in sort order. Note that this is different from *sorting* a dict, which (if it were supported) has to be done explicitly. Between sorts, the dict would be capable of getting out of sorted order. The idea of a *sorted* dict is that the sort order is an invariant, rather than something that can come and go as you insert and delete items. -- Steve

On Wed, 2021-11-10 at 17:16 +1100, Steven D'Aprano wrote:
It could be handy for deterministic iteration of its values, for example to allow serialized values to be easily compared, or to generate and verify a signatures. It's edge case material; I wouldn't think it's strong enough use case for inclusion in stdlib.

On Tue, Nov 9, 2021 at 11:05 PM Paul Bryan <pbryan@anode.ca> wrote:
On Tue, Nov 09, 2021 at 10:01:35PM -0800, Christopher Barker wrote:
What are use cases for sorted dicts?
Good question :-) It could be handy for deterministic iteration of its values, for example to
allow serialized values to be easily compared, or to generate and verify a signatures.
Yup -- I've done that. But for that, it's fine to sort when you are doing the comparing / serialization. What I haven't been able to imagine is a use case for keeping a Mapping sorted constantly as you insert / remove items. For that you'd need a use case where you were making sorted queries of various sorts. I noticed, for instance, that the Java implementation posted earlier had methods like "all the items after this one" (can't remember how that was spelled). I'm sure there are use cases for this, I'm probably lacking imagination :-) -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

[Christopher Barker <pythonchb@gmail.com>]
For example, for some mappings with totally ordered keys, it can be useful to ask for the value associated with a key that's not actually there, because "close to the key is good enough". A SortedDict supports several ways of doing that efficiently, depending on what - exactly - an app means by "good enough". For example, it's easy to find _the_ closest key. Or the k closest keys. Or the two single keys <= and >=. Or ... Pick one; take some form of average of their values; use a domain-specific interpolation function on their keys & values; ... To make it concrete, suppose you're running a COVID-19 app where reports of new cases are coming in around the clock. One SortedDict may be used to map latitude to the number of confirmed cases reported. If a researcher asks for the number at a specific latitude, but no exact match is found, fine, you can still efficiently show them the results across the smallest known interval containing the latitude of interest. Or if they want the results broken into buckets of latitudes in 15-degree (N-degree - any N they ask for) intervals, also easy: enumerating all entries in a given key range also goes fast. Or ... A sorted list supports all sorts of membership queries efficiently. Under the covers, a SortedDict _is_ a SortedList, plus a regular Python dict to associate values with the list's elements.

A quick Google for "treap python github" yielded https://github.com/TheAlgorithms/Python/blob/master/data_structures/binary_t... . On Tue, 9 Nov 2021 at 21:49, Dan Stromberg <drsalists@gmail.com> wrote:
-- OpenPGP: https://hasan.d8u.us/openpgp.asc If you wish to request my time, please do so using *bit.ly/hd1AppointmentRequest <http://bit.ly/hd1AppointmentRequest>*. Si vous voudrais faire connnaisance, allez a *bit.ly/hd1AppointmentRequest <http://bit.ly/hd1AppointmentRequest>*. <https://sks-keyservers.net/pks/lookup?op=get&search=0xFEBAD7FFD041BBA1>Sent from my mobile device Envoye de mon portable

Hi Bob and welcome, Before we could even consider adding the sortedcontainers library to the standard library, we would need to hear from the maintainer(s) of the library that they agree to the move and would be able to continue maintaining the library under our release schedule and backwards compatibility guarantees. Otherwise, you would need to find a core developer willing to re-implement the containers and maintain them. I don't want to discourage you, but even if the maintainer is willing, we night not decide to add it. Every new feature, class and function adds to the weight of learning Python, and the cost of maintenance. We must balance that against the benefit, and only add features where the benefits are greater than the costs. Our decision making is usually very conservative, because we have strong requirements for backwards-compatibility. Once we add something to the stdlib, we can't easily change our mind and remove it again. So we follow the Zen of Python: >>> import this The Zen of Python, by Tim Peters [...] Now is better than never. Although never is often better than *right* now. Have you looked at the Python PEPs? If you are serious about pushing this proposal, your first step should be to read PEP 1 and then browse through the collection of successful and unsuccessful PEPs: https://www.python.org/dev/peps/pep-0001/ https://www.python.org/dev/peps/ -- Steve

Hi Steve and all, Thanks!
Totally agree. And as others mentioned in this thread, I agree the use cases for Sorted Containers are weak to make it worthwhile to be added to stdlib. I was more thinking about the completeness of the stdlib when I was proposing, but I do agree there is a cost/benefit analysis need to be done and it is great to see all the responses. As other pointed out this is mainly useful to implement some kind of cache, but since we already have lru_cache a lower level sorted dict may not be needed. Thanks for all the responses. Best, Bob

10.11.21 01:47, Bob Fang пише:
From my C++ and Java experience, hashtable-based containers are much more useful than tree-based containers. They are more compact and fast. In most cases the only reason of using sorted container is supporting deterministic iteration order, but often it is enough to sort data only for output. It is easy to do with the sorted() builtin in Python. We need some dict implementatin in Python, it is essential part of the language used to implement modules, classes and objects. And hashtable-based implementation is good for this. There is no such need of tree-based implementation. It is not needed in Python core, and is not needed for most users. So it is good to keep it in third-party libraries. Maintaining it in the stdlib has a cost and the benefit/cost value would be low.

Serhiy Storchaka wrote:
The concern is not constant speed factor but asymptotic complexity. Inserting or deleting in a tree-based sorted collection is O(log n), but lists are O(n): the bisect is O(log n) but moving trailing list elements is O(n). It's memmove fast but dominates as n gets large. There are many algorithms that are hard to implement as fast as they can be without a tree. Something like rolling median would be a good test case, e.g. "for a list N containing numbers, produce a list M of length len(N) - 1 where M[i] == statistics.median(N[:i+1])". I think the best you can do with the standard library and without writing a balanced binary tree from scratch is O(n²) (if you can beat this I'd like to know how!). With a tree-based sorted list it's O(n log n). Dicts and sets cannot maintain an arbitrary order, except that OrderedDict.move_to_end() very occasionally turns out to be all you need (and even then, I find it tricky to reason about and apply correctly in an algorithm: it requires you to maintain an invariant, where SortedList maintains the sorted invariant for you).
The situation where this bites my firm the most is in interviews. This is exactly the situation where we're asking candidates to demonstrate their CS knowledge and produce an algorithm that has low asymptotic complexity. We use various services including HackerRank, which provide a sandboxed Python environment that cannot use PyPI. There are quite a few services in this kind of Educational+Challenge space including some that execute Python in the browser using Brython etc. My team believes that candidates with a Java background find the challenge easier than candidates with a Python background, because they just use NavigableSet and move on. I find that kind of embarrassing. It requires us to apologise for Python's deficiency and fudge how we score the question, which makes it hard to know we're providing a level playing field. (I would argue that we should not ask those kinds of questions or not use these sandboxed interview environments but institutional friction makes that magnitude of change very difficult.) Putting something in the standard library means that it will, in due course, appear in HackerRank and other services we use, and we can expect candidates to be familiar with it and incorporate it in their solutions.

IMO if someone is motivated to get a new container type in Python, a PEP is required. A PEP has been written for the new removeprefix() and removesuffix() methods which are way simpler ;-) I expect many questions on corner cases for a sorted container type. Having a reference implementation, even written in Python, would be helpful. Victor On Wed, Nov 10, 2021 at 1:00 AM Bob Fang <boyufang@bytedance.com> wrote:
-- Night gathers, and now my watch begins. It shall not end until my death.

[Bob Fang <boyufang@bytedance.com>]
This is a modest proposal to consider having sorted containers (http://www.grantjenks.com/docs/sortedcontainers/) in standard library.
+1 from me, but if and only if Grant Jenks (its author) wants that too. It's first-rate code in all respects, including that it's a fine example _of_ Python programming (it's not written in C - in Python). People often say "well, put a thing on PyPI first, and see whether people really want it!". SortedContainers has been on PyPI for years straight now, and still gets hundreds of thousands of downloads from there every day: https://pypistats.org/packages/sortedcontainers So it's already widely adopted in the community. What else would it need to prove? If it still doesn't qualify, "well, put a thing on PyPI first" is just obstructionism ;-)

I agree with Tim. Subject, of course, to the same caveat Tim mentions: does the creator want this? I haven't used the library much, but it's obviously top quality, and adding pure-Python code is less burden than C implementations. On Wed, Nov 10, 2021, 10:19 PM Tim Peters <tim.peters@gmail.com> wrote:

On Thu, 11 Nov 2021 at 11:51, Antoine Pitrou <antoine@python.org> wrote:
I agree as well. Is anyone interested enough to ask the library author if he supports doing this? That seems to be the main unanswered question here. But if anyone wants to argue the "the stdlib should be shrinking, not growing" position, I suggest they do so *before* someone reaches out to the module author. No point in us making the suggestion and then being forced to withdraw it. Paul

On Thu, 11 Nov 2021 12:39:50 +0000 Bob Fang <bob.fang.london@gmail.com> wrote:
Decisions are not made through voting, so that would be informative, but little more :-) I think it's better that you contact the author upfront, asking for a provisional agreement about stdlib inclusion. If he's not interested, then no need to pursue it further. If he's interested, then you can start working on a more formal proposal (for example a PEP). Regards Antoine.

I'll join Christopher Barker and Chris Angelico in voicing scepticism -- personally, I feel like I'm yet to be persuaded that the use case is strong enough. sortedcontainers is a wonderful package, and I would completely support a prominent link to the library in the Python documentation. But the use case seems too niche and specific, in my opinion, for the library to be added to the standard library. I see the standard library as a collection of basic building blocks that can be used to create a multitude of other things. sortedcontainers feel like a specialised solution to a small collection of problems, rather than a general solution to a large collection of problems. PyPI feels like the right place for this library. Best, Alex On Thu, Nov 11, 2021 at 12:33 PM Paul Moore <p.f.moore@gmail.com> wrote:

Earlier in the thread, we were pointed to multiple implementations. Is this particular one clearly the “best”[*]? If so, then sure. -CHB [*] best meaning “most appropriate for the stdlib”. A couple folks have already pointed to the quality of the code. But my understanding is that different algorithms are more or less appropriate for different use cases. So is this one fairly “universal”? On Thu, Nov 11, 2021 at 4:31 AM Paul Moore <p.f.moore@gmail.com> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

[Christopher Barker <pythonchb@gmail.com>]
As noted earlier, SortedContainers is downloaded from PyPI hundreds of thousands of times every day, and that's been so for a long time.. Best I can tell, no similar package is in the same universe as that. The author identified over a dozen potential competitors here[1], but most didn't look like serious projects to him. He ran extensive timing tests against the rest, reported there too. HIs package usually wins, and is at worst competitive. It usually wins on memory burden too. Some of these things are people intrigued by "a data structure" they read about and implement for fun and self-education. That sometimes shines through in data-structure-specific APIs. While the SortedContainers extensive docs explain how things are implemented (if you look in the right spots for that), the API is remarkably agnostic (& useful)). For example, this is a method of SortedList: irange(minimum=None, maximum=None, inclusive=(True, True), reverse=False) which returns an iterator over a contiguous sorted range of values (neither, either, or both of whose endpoints may be included, depending on what you pass for `inclusive` This "makes sense" regardless of how the structure may be implemented. Of course SortedSet and SortedDict support `irange()` too. [1] http://www.grantjenks.com/docs/sortedcontainers/performance.html

On Thu, Nov 11, 2021 at 4:30 AM Paul Moore <p.f.moore@gmail.com> wrote:
A couple 2 cents and IMHO. ----- The tldr is not so much "don't put it in the standard lib", but that the stdlib would be better having particular implementations instead of a one-size-fits-all e.g. SortedList. Should the stdlib have e.g. SortedList? Probably not, because the use cases of such data types are too niche to a one-size-fits-all implementation, and there are too many implementations with too many of their own settings. Should it have e.g. BTree, RedBlackTree, SortedArrayList, etc? Probably so, because those are somewhat fundamental data structures and implementing, testing, and verifying them is very much non-trivial. While niche, having them at your immediate disposal is very powerful. ----- Last year, for fun, after wishing there was a SortedSet in the standard lib, I ended up implementing a Red-Black Tree and BTree based sorted dictionary/set[1]. After then trying to use them for my use case[2], I found that, in order to fully and truly exploit their benefits, the basic Sequence/Collection/Set/Dict APIs didn't really suffice. I needed APIs that would let me, e.g. binary search to a particular spot and then iterate, or give me a range between two points, etc. Such apis tend to be pretty specialized to exploit their under data structure (which isn't to say an abstract API can't be created, see Java's NavigableMap, but that level of abstraction deserves careful consideration). In the end, it was a fun exercise, but in practice a dictionary and sorted() got me 90% of the way there and sufficed. Optimizing that last 10% wasn't worth the effort. Anyways, I came to two particular, IMHO, conclusions: 1. Sorted collections are very niche. I *could* have used one, but probably would have spent as much time fiddling with them as using a dict/sorted(). 2. If you're in a niche use case, especially for performance, then abstractions aren't really helpful. Using e.g. a BTree and knowing the particulars of the implementation are going to be more helpful than having an abstract SortedList and working out how to use its special APIs for particular actions. This basically echos Christopher Baker's sentiment of "whats the use case" and "why doesn't a sorted() call at the end suffice" to some extent. IME, Python's sort algorithm is just really good, so it's hard to beat simply calling sorted() when you're done processing in both runtime performance and effort-spent-optimizing. I struggle to think of or remember a particular case where dropping in a sorted collection would be a clear and obvious win. Maybe a more memory constrained case where sorted() creating a new list is too much? IDK. [1] As an aside, RB didn't perform nearly as well as sortedcontainers for the majority of cases, but the naive BTree was about on par. IIRC, sortedcollections is basically a specialized BTree. [2] A wish that came about because I had sorted sets of exact/prefix strings of 1,000 to 10,000 elements and needed them to interact in various ways and wanted to preserve their sortedness. ----- re: "Is this particular one clearly the “best”[*]?" Performance wise, it's probably about the best insofar as a BTree of depth=2 and high fanout (10,000 or something? I forget what sortedcontainers defaults to) is. In my limited experience, it's docs and claims about its performance were pretty accurate[3]. API wise, it has its share of idiosyncrasies like anything else. e.g. SortedList claims to implement MutableSequence, but actually raises errors when calling e.g. append(), which is a bit misleading / wrong to some extent (to be fair, given MutableSequence's api contract, it's not like there's particularly clear/right/good choice one can make here). (disclaimer: i've minimal experience with sortedcontainers; when I looked at it, it was really only for comparing my own toy BTree/RBT behavior/performance with a known implementation) [3] The exception being a claim that its (supposedly) better cache locality is a significant factor. I think the underlying BTree algorithm is the dominant factor. But someone who actually knows how to measure cache locality would be a better judge of that than me, who doesn't know how.

On Thu, 11 Nov 2021 11:01:32 -0800 Richard Levasseur <richardlev@gmail.com> wrote:
So the fact that sorted collections did not help for one single use case leads you to conclude that "sorted collections are very niche"? I don't find this very convincing. Unfortunately, PyPI doesn't seem to offer a way to query the reverse dependencies of a package, otherwise we could know how many packages depend on the "sortedcontainers" package. [...]
You don't have "an abstract SortedList", you have a concrete SortedList with a well-defined implementation with stable performance characteristics. So, again, this seems to be a red herring.
This basically echos Christopher Baker's sentiment of "whats the use case" and "why doesn't a sorted() call at the end suffice" to some extent.
The use case is obviously when you have to *maintain* sorting at all times. Yes, there are such situations. Some rather high-profile Python packages, such as dask/distributed, rely on this for rather elaborate algorithmic work (dask/distributed is a distributed task scheduler, you may also think of it as a replacement for Spark). I won't go over your performance considerations, but I'll just mention that the "sortedcontainers" documentation has not one but *several* sections exploring performance characteristics in several dimensions. It is extremely rare for Python packages to provide such a detailed insight on the matter (even the built-in `dict` is much less thoroughly characterised). http://www.grantjenks.com/docs/sortedcontainers/#user-guide Regards Antoine.

Just want to mention that we do have this nice website to look at, although I am not sure how up to date it is: https://www.wheelodex.org/projects/sortedcontainers/rdepends/?page=1

On Thu, 11 Nov 2021 20:58:29 +0000 Bob Fang <bob.fang.london@gmail.com> wrote:
Just want to mention that we do have this nice website to look at, although I am not sure how up to date it is:
https://www.wheelodex.org/projects/sortedcontainers/rdepends/?page=1
Wow, thank you, that is very nice! Best regards Antoine.

On Thu, Nov 11, 2021 at 11:22 AM Antoine Pitrou <antoine@python.org> wrote:
It was just an example demonstrating that, despite all the apparent "things need to be kept sorted"-ness of the problem, the strict requirement that data be fully sorted at every step of the code wasn't actually present. That sorting at the end is quite sufficient. It just had the mere appearance of being a requirement. This example is not unlike the exchange Paul Bryan and Christopher Baker had upthread where Paul gave an example, and Christopher pointed out sorting at the end was sufficient. And no, that I didn't choose[1] to use a sorted collection for that particular case didn't make me conclude they were very niche. Reflecting on my experience coding and reviewing code over, eh, 20 years? Or w/e. lead me to realize that very rarely have I seen, needed, or recommended using a sorted collection. In my experience, the order of elements in a collection mattering is the exception, not the rule. Hence, I conclude they're niche. [1] As my example described, a sorted collection would have been able to realize some optimal performance characteristic, but eking out that last bit wasn't worth the effort.
Querying my employer's code base, the number of hits for sortedcontainers isn't particularly impressive. It's not so low to be dismissed, but it's also not so high as to be anywhere near unequivacable. Forgive me for not sharing the actual numbers; I'm just not really sure on what sort of code base statistics and numbers I'm permitted to share under what circumstances etc.
My point is that there are many ways to implement a collection that maintains its elements in sorted order that has well defined performance. One can pick one and accept the pros/cons of the particular underlying algorithm. But it seems misleading to call that *the* SortedList implementation. Hence why I say "abstract SortedList". To emphasize this point a bit, let's presume for the moment Python chose to have a data type named SortedList, and it was, effectively, the sortedcontainers implementation. What does one do with the "load factor" setting that sortedcontainers.SortedList has? This arg is, essentially, a BTree implementation detail leaking out. IIRC, it controls the fan out of the underlying BTree (because one size doesn't always fit all), which is one of the biggest determiners of the overall performance. If such a parameter is exposed, then it's no longer really a generic "sorted list", it's a "sorted list as implemented using a btree". So might as well call it e.g. BTreeList or some such instead. If such a parameter *isn't* exposed, then a one-size-fits-all value must be chosen[2] -- I'm skeptical that's really tenable because, if a sorted collection is needed, it's probably for performance reasons, and you need knobs like that to better fit the algorithm to your data and access patterns. And I guess I have to state upfront that my claim of "it's probably for performance reasons" is just what my experience leads me to believe, so it'll just have to be taken with the dosage of salt that best fits, too. [2] As an aside, that a list's growth factor isn't controllable works against things like sortedcontainers. Which isn't to say `list` should expose it, but that if, e.g. ArrayList(growth_factor) was a thing, then a more optimal implementation of e.g. BTree would be possible.
Are you able to elaborate wrt their needs to use a sorted collection? Something that would be very compelling (to me, at least) is demonstrating the utility of sorted collections where their performance characteristics isn't the only/sole/primary/dominant motivator for using one. Performance is important, but it's also kind of fuzzy and easy to argue about without a clear answer/conclusion. I hope we can all agree that there do exist situations where a sorted collection is, at the least, helpful. That they aren't inherently useless. I think the main thing being questioned is when/if they are, in effect, *necessary throughout*. That when this is true: * The data must, at all times, without exception, be and stay sorted from its inception to its conclusion Then this is also true: * The case is not niche Does "dask/distributed" meet that criteria? If so, why? I have no experience, knowledge, or opinion about "dask/distributed". Maybe it *is* a compelling example, or maybe not. IDK.
Yes, his performance analysis really is top notch.

sortedcontainers looks great! I very much appreciate it if such well-done library have type hints (embedded or at least in form of types-sortedcontainers pypi package). From my experience of multidict library support, it is not a super-hard challenge but something that needs attention and time resource from the library maintainers or third-party volunteers. On Thu, Nov 11, 2021 at 9:06 PM Richard Levasseur <richardlev@gmail.com> wrote:
-- Thanks, Andrew Svetlov

On Thu, Nov 11, 2021 at 11:01:32AM -0800, Richard Levasseur wrote:
By that reasoning, we shouldn't have dicts, for exactly the same reason: for anyone wanting an associative array, there are so many implementation variants to choose from: - hash table with linear probing - hash table with chaining - AVL tree - red-black tree - judy array - splay tree - treap - scapegoat tree and many more, and most of them can be tuned. If Python didn't already have dict, your argument against SortedList would equally apply to it: they are "fundamental data structures and implementing, testing, and verifying them is very much non-trivial". So if your argument is correct, that would imply that standardizing on one single dict implementation, one that isn't even tunable by the caller, was a mistake. We should have provided a dozen different hash tables, trees and treaps. But... we didn't, and I don't think that Python is a worse language because we only have one associative array implementation in the stdlib. Whenever I need an associative array, I don't lose sleep over whether I could get 2% better performance for hits, at a cost of 17% worse performance for misses, by swapping over to some other implementation. I just reach for dict, knowing that it will almost always be good enough.
I believe that sortedcontainers.SortedSet provides that functionality via the irange() method. http://www.grantjenks.com/docs/sortedcontainers/sortedlist.html#sortedcontai... -- Steve

I think Richard's point was two-fold: People usually don't want or need this kind of thing /except/ when they have some very specific performance requirements, in which case they probably want to also be very specific about the kind of container they are using rather than using an abstract "sorted container". It is true that people sensitive to performance may really care about the way dict is implemented, but there are a great many uses for associative arrays in general. I knew about sortedcontainers and I also don't remember ever seeing a situation where I needed one or recommended its use. Maybe it would be useful in leetcode-style interview questions (which may by itself be a good reason to include it in the standard library, since a lot of those interviewers will let you use `collections.deque` or `collections.OrderedDict` without implementing any sort of linked listed, but they won't let you import a module that provides a trie or something)? I'm fairly neutral on this proposal. I do think the standard library is a good place for these sorts of fundamental data structures (even abstract ones), but I do agree with Richard that they are pretty niche. In almost any situation where you'd want / need something like this, I feel like adding a PyPI dependency would not be a big deal. One thing that I do think might be valuable is if the plan were to re-implement these as C extensions. Maintaining and distributing C extensions on PyPI is in many ways more difficult than bundling them directly into CPython. That said, doing so adds maintenance burden and implementation cost (and it's a different proposition than merely adopting an existing library), so I'm probably -0.0 on the whole thing — the absence of these types of containers in the standard library is not an obvious lack, and getting them from PyPI in the situations where you actually need them doesn't seem like a big deal, particularly when it's a pure Python impementation. On 11/11/21 21:43, Steven D'Aprano wrote:

On Fri, Nov 12, 2021 at 10:07:13AM -0500, Paul Ganssle wrote:
I knew about sortedcontainers and I also don't remember ever seeing a situation where I needed one or recommended its use.
We have a very odd situation where apparently sortedcontainers is one of the most well-known, popular, most heavily downloaded libraries on PyPI. According to here: https://hugovk.github.io/top-pypi-packages/ sortedcontainers is the 290th most popular package on PyPI, ahead of such luminaries as pylint, black, selenium, mypy, django and nose. And yet, nobody(?) admits to either using it or knowing what it could be used for. How very curious :-/ -- Steve

On Sat, Nov 13, 2021 at 3:20 AM Steven D'Aprano <steve@pearwood.info> wrote:
I think that proves the value of download counts: very little. The highest on the list is a thing called botocore, which I've never heard of. What is it? It's a dependency of a number of Amazon web services. My guess is that it's a dependency of popular tools - or maybe of automatically-installed tools, even - while not itself being well known. Actually, quite a few of the most popular packages on that list are Amazon-related, so quite possibly they're all being installed as a set in response to one single *actual* dependency. (Download counts DO most likely indicate that something is heavily used, though, so if people are trying to plan out a benchmark suite, then that might be more useful. But I'm not really surprised that a heavily-downloaded package isn't itself well known.) ChrisA

On 12.11.2021 17:10, Steven D'Aprano wrote:
Those download stats can be misleading. Packages are often pulled in as a dependency of other packages which are popular or used a lot in CI/CD setups. Perhaps there's a reverse dependency graph we could use to find out why the package is downloaded this often. I remember having seen a project which does this, but have lost the URL. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 12 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

I believe this is the URL we are looking for: https://www.wheelodex.org/projects/sortedcontainers/rdepends/?page=1 <https://www.wheelodex.org/projects/sortedcontainers/rdepends/?page=1> Best, Bob

On 12.11.2021 17:46, Bob Fang wrote:
Yes, thanks, that was the one. The listing doesn't seem to solve the puzzle either, though. There are a few high profile projects using the package, but those are not high up on the download ranking, so don't explain the high download counts of the package. BTW: This reverse dependency ranking on the website gives a good idea of how important a package is for the Python package eco system: https://www.wheelodex.org/rdepends-leaders/ (sort of like the Page rank for Python packages) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 12 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

I started writing up a SortedDict use case I have, but it's very elaborate and I expect it would just end with endless pointless argument about other approaches I _could_ take. But I already know all those ;-) So let's look at something conceptually "dead easy" instead: priority queues. They're a basic building block in algorithms from many fields. In the distributed Python, `heapq` is the only thing that comes close to being reasonably efficient and scalable for that. However: - It only supports min-heaps. It's not that max-heaps aren't also useful (indeed, _under the covers_ CPython also implements max-heaps for its own internal use). It's more that the original API simply doesn't generalize cleanly. - So there's a body of word-of-mouth "tricks" for making a min-heap "act like" a max-heap. For integer or float priorities, it suffices to use the negation of "the real" priority. For other kinds of values, it gets trickier. In the end, you can wrap each item in a class that implements `x.__lt__(y)` by doing `y.value < x.value`. But few people appear to be aware of that, while nobody wants to endure the bother ;-) - It's often the case that "the priority" needs to be computed from an item's value. `heapq` has no direct support for that. Instead there's another body of word-of-mouth tricks akin to the old decorate-sort-undecorate approach sorting used. Rather than store the values in the heap, you store `(priority, value)` tuples. But then various unpleasant things can happen if two priorities are tied: in that case tuple comparison falls back to comparing the values. But, e.g., value comparison may be very expensive, or the values may not support __lt__ at all. So instead you store `(priority, unique_integer, value)` triples. And hope that you don't really need a max-heap, lest it get even more obscure. Using SortedContainers, none of those are issues. It's all straightforward and obvious. If you have a list L always sorted by priority, extracting from a "min queue" just requires L.pop(0), or from a "max queue" L.pop(-1) (or L.pop() - -1 is the default, same as for the Python list.pop()). No tricks are needed. Fine too if sometimes you want to extract the min, while at other times the max. Or, e.g., peek at the 5 highest-priority values and pick the one that best fits resources available at the time. In cases of computed priorities, the SortedKeyList constructor allows specifying a function to be used to compute the keys used to determine the list's ordering. Again straightforward and obvious. from sortedcontainers import SortedKeyList L = SortedKeyList([(i, j) for i in range(1, 5) for j in range(1, 5)], key=lambda t: t[0]/t[1])
That said, in most of my code I end up _removing_ uses of SortedContainers, in favor of faster ways of getting the job done. The package isn't the last resort for me, it's the first. I can write code in straightforward ways that scale well from the start. If I need more speed later, fine, I can puzzle out faster ways to get the task done. If you're proud of that you _start_ by plotting ways to minimize the number of sorts you do, you're coding in C++, not Python ;-) So I suggest people are staring at the wrong end when asking for use cases that can't be done without the package. Those are necessarily non-trivial. Having a sorted collection is "more than good enough" for many tasks. As above, e.g., there's no reason to imagine that Grant Jenks had priority queues in mind at all, yet the flavors of sorted lists he implemented are very easy to use for that task.

Yeah, a datapoint I didn't see mentioned is searching on public github repos for import sortedcontainers (which includes from sortedcontainers import). Obviously it's just one datapoint but it shows a very small count compared to other packages mentioned when looking at many of the stats available: https://github.com/search?q=import+sortedcontainers+language%3APython&type=code https://github.com/search?q=import+pylint+language%3APython&type=code https://github.com/search?q=import+black+language%3APython&type=code https://github.com/search?q=import+mypy+language%3APython&type=code https://github.com/search?q=import+django+language%3APython&type=code Damian On Fri, Nov 12, 2021 at 11:39 AM Marc-Andre Lemburg <mal@egenix.com> wrote:

And yet, nobody(?) admits to either using it or knowing what it could be used for. How very curious :-/
trio (which IMHO is a somewhat high profile uses it): https://github.com/python-trio/trio/blob/master/trio/_core/_run.py#L27 <https://github.com/python-trio/trio/blob/master/trio/_core/_run.py#L27> And so if I am not mistaken any project that depends on trio will automatically need to have sortedcontainers as its dependency? I am not saying this alone will explain why sortedcontainers have large number of downloads, but with one or two projects like this it will push the number up maybe? Best, Bob

And yet, nobody(?) admits to either using it or knowing what it could be used for. How very curious :-/
trio (which IMHO is a somewhat high profile uses it): https://github.com/python-trio/trio/blob/master/trio/_core/_run.py#L27 <https://github.com/python-trio/trio/blob/master/trio/_core/_run.py#L27> And so if I am not mistaken any project that depends on trio will automatically need to have sortedcontainers as its dependency? I am not saying this alone will explain why sortedcontainers have large number of downloads, but with one or two projects like this it will push the number up maybe? Best, Bob
participants (22)
-
Alex Waygood
-
Andrew Svetlov
-
Antoine Pitrou
-
Bob Fang
-
Bob Fang
-
Chris Angelico
-
Christopher Barker
-
Damian Shaw
-
Dan Stromberg
-
Daniel Pope
-
David Mertz, Ph.D.
-
Eric V. Smith
-
Hasan Diwan
-
Marc-Andre Lemburg
-
Paul Bryan
-
Paul Ganssle
-
Paul Moore
-
Richard Levasseur
-
Serhiy Storchaka
-
Steven D'Aprano
-
Tim Peters
-
Victor Stinner