
There are quite a few important algorithms which require lists to be sorted. For example, the bisect module, and for statistics median and other quantiles. Sorting a list is potentially expensive: while Timsort is very efficient, it is still ultimately an O(N log N) algorithm which limits how efficient it can be. Checking whether a list is sorted is O(N). What if we could check that lists were sorted in constant time? Proposal: let's give lists a dunder flag, __issorted__, that tracks whether the list is definitely sorted or not: - Empty lists, or lists with a single item, are created with __issorted__ = True; lists with two or more items are created with the flag set to False. - Appending or inserting items sets the flag to False. - Deleting or popping items doesn't change the flag. - Reversing the list doesn't change the flag. - Sorting it sets the flag to True. (The sort method should NOT assume the list is sorted just because the flag is set.) Functions that require the list to be sorted can use the flag as a quick check: if not alist.__issorted__: alist.sort() ... The flag will be writable, so that functions such as those in bisect can mark that they have kept the sorted invariant: bisect.insort(alist, x) assert alist.__issorted__ Being writable, the flag is advisory, not a guarantee, and "consenting adults" applies. People can misuse the flag: alist = [1, 4, 2, 0, 5] alist.__issorted__ = True but then they have nobody to blame but themselves if they shoot themselves in the foot. That's no worse than the situation we have now, were you might pass an unsorted list to bisect. The flag doesn't guarantee that the list is sorted the way you want (e.g. biggest to smallest, by some key, etc) only that it has been sorted. Its up to the user to ensure they sort it the right way: # Don't do this and expect it to work! alist.sort(key=random.random) bisect.insort(alist, 1) If you really want to be sure about the state of the list, you have to make a copy and sort it. But that's no different from the situation right now. But for those willing to assume "consenting adults", you might trust the flag and avoid sorting. Downsides: - Every list grows an extra attribute; however, given that lists are already quite big data structures and are often over-allocated, I don't think this will matter much. - insert(), append(), extend(), __setitem__() will be a tiny bit slower due to the need to set the flag. Thoughts? -- Steven

On Sun, Apr 7, 2019 at 7:37 PM Steven D'Aprano <steve@pearwood.info> wrote:
But this flag doesn't affect those modules, right? 'bisect' already requires the user to ensure that the list is sorted appropriately, and this bit:
...seems to mean that the 'statistics' module can't use this flag either. It doesn't seem very likely to me that the savings from this flag could outweigh the extra overhead it introduces, just because list operations are *so* common in Python. If you want to push this forward, the thing I'd most like to see is some kind of measurements to demonstrate that average programs will benefit. -n -- Nathaniel J. Smith -- https://vorpus.org

On Sun, Apr 07, 2019 at 08:26:24PM -0700, Nathaniel Smith wrote:
Naturally the bisect and statistics modules (or any other that requires sorting) won't change to inspect this flag by magic, the code will require modification. Possibly the maintainer of bisect may decide that its not worth the change. But for the statistics module, I would certainly change the implementation of median() to look something vaguely like this: # was data = sorted(data) # may be expensive if data is large # may become if not (isinstance(data, list) and data.__issorted__): data = sorted(data) statistics is soon to grow a quantiles() function, but the thing with quantiles is often you want to get a bunch of them: # This only makes sense if data is a sequence (list) # not an iterator. quartiles = quantiles(data, n=4) quintiles = quantiles(data, n=5) deciles = quantiles(data, n=10) percentiles = quantiles(data, n=100) That's four calls to sorted(). The caller could drop that down to one: data.sort() quartiles = ... etc Now before anyone else mentions it, we could give the function a "dont_sort" argument, or "already_sorted" if you prefer, but I dislike that kind of constant-bool parameter and would prefer to avoid it.
"Consenting adults" applies. If you want to pass an unsorted list to the functions, but pretend that its sorted, then on your own head be it. There's no real difference between these two hypothetical scenarios: data = [1, 4, 2, 0, 5, 3] garbage = median(data, already_sorted=True) versus: data = [1, 4, 2, 0, 5, 3] data.__issorted__ = True garbage = median(data) I'm perfectly comfortable with allowing the caller to lie if they want. Its their own foot they're shooting. (I wouldn't be so blasé about this if it were a function written in C that could segfault if the list wasn't sorted.)
I'm not sure that the average program uses sort *at all*, so a better set of questions are: - how much will this hurt the average program? (my gut feeling is "very little", but we'd need benchmarks to know) - are there any other use-cases for sorted data that could benefit? - how much will they benefit? Let's say, for the sake of the argument that this proposal makes the average program 0.01% slower, but helps sorting-heavy programs be 2% faster when dealing with large lists, then I think that might be a win. (Remember, this is about large lists. For small enough lists, the cost of sorting them in minor.) I can't benchmark the cost of setting the flag accurately (it would want to be done in C, doing it in pure Python using a subclass is very expensive). But for what its worth, on my old and slow computer, it takes 13.5 seconds to sort in place this list of 50 million integers: L = list(range(10000000))*5 and 15 seconds to make a sorted copy. Once sorted, subsequent sorts are faster, but still time consuming, about 2.5 seconds for an in-place sort. So on my machine, this will save about 2-3 seconds per sort that I can avoid doing. More if the list is copied. (I know it is totally unfair to benchmark the benefit without also benchmarking the cost. Deal with it :-) -- Steven

On Mon, 8 Apr 2019 at 10:10, Steven D'Aprano <steve@pearwood.info> wrote:
So just to be clear, this would be a change in behaviour - for a list that is currently sorted on a key that is *not* "numerically ascending", the code will no longer re-sort, and so will give the wrong answer? (Maybe the code is able to deal with either ascending or descending orders, I don't know about that, but even so, that just makes the failure rarer, not non-existent). I'm not saying that this is forbidden, just want to be clear if that's what you mean (because my difficulty with the proposed attribute is that it seems unreliable enough that I can't imagine a case where I'd feel comfortable using it myself). Paul PS I thought timsort was highly efficient given already sorted data? Whether it's OK to rely on that I don't know, but did you take that into account?

On Mon, Apr 08, 2019 at 10:34:19AM +0100, Paul Moore wrote:
Correct. A thought comes to mind... Currently the two variant forms of median (median_low and _high) can work with non-numeric data: # don't take this example too seriously py> statistics.median_low(['low', 'high', 'high', 'low', 'very low']) 'low' so perhaps there is a use-case for non-numeric sorts.
Fair enough, but does it help you feel a bit better about the feature if we called it a "sort hint", and emphasized that it should only be used in the same sort of APIs where you might allow the caller to pass already_sorted=True to skip a redundant sort step?
Yes, timsort is very efficient with already sorted data, and I have :-) On my computer, to sort 50 million integers in place takes about 13.5 seconds; to sort it the next time takes 2.7 seconds. (That's effectively just galloping along the list, looking for out-of-place items and not finding any.) To anyone tempted to say "Get a better computer", if I had a better computer I'd be using a list of 50 billion integers and it would still take 2.7 seconds :-) -- Steven

On Mon, 8 Apr 2019 at 11:27, Steven D'Aprano <steve@pearwood.info> wrote:
Not really, because already_sorted=True is an explicit claim by the user, whereas the (internally maintained) dunder attribute is the interpreter maintaining a (not guaranteed reliable but pessimistic to be as safe as possible) view on whether the data is sorted. But I see elsewhere in the thread that you're now more inclined towards an explicit already_sorted flag, so I won't labour the point. Paul

On Mon, Apr 8, 2019, 02:09 Steven D'Aprano <steve@pearwood.info> wrote:
Right, by "doesn't affect" I meant "cannot get any benefit, even if their code is modified". Possibly the maintainer of bisect may decide that its not worth the
If only we had some kind of API that could compute multiple quantiles at the same time...
An already_sorted=True argument would be an explicit opt in, and consenting adults would apply. But your message was very explicit that __issorted__ can be set implicitly, though. For example, this would give garbage results: # implicitly sets the sorted flag data.sort() # preserves the flag, because hey it's sorted by *some* key data.reverse() statistics.median(data) You can't use this in statistics.median because it would break compatibility. Also, isn't the whole point of 'statistics' to be the simple, reliable module for folks who aren't that worried about speed? This seems like a massive footgun.
(I wouldn't be so blasé about this if it were a function written in C that could segfault if the list wasn't sorted.)
Silently giving the wrong answer is way worse than a segfault.
Obviously these are made up numbers, but if they were real then for it to be a net win you would still need at least 1 in 200 programs to be "sorting heavy" in a way that could benefit from this flag, and I don't believe that's true. -n

On Mon, Apr 08, 2019 at 02:55:54AM -0700, Nathaniel Smith wrote:
Right, by "doesn't affect" I meant "cannot get any benefit, even if their code is modified".
Ah, sorry I misunderstood you.
You mean something like this? quartiles, quintiles, deciles, percentiles = quantiles(data, n=(4, 5, 10, 100)) Yuck :-) I'd rather allow an already_sorted=True parameter :-)
It would certainly be putting more responsibility on the caller to ensure the sorted flag was correct.
You can't use this in statistics.median because it would break compatibility.
I would argue differently.
Perhaps. Perhaps not as massive as it seems. The expected use-case would be something like this: data = get_data() a = median(data) data.sort() b = median(data) # Like a, but faster. To be a problem, the caller would need to do something like: data = get_data() a = median(data) data.sort(reversed=True) # or some weird key function b = median(data) # Garbage result. I can't see people shooting themselves in the foot in this way by accident very often. But fair enough, it is a risk.
It depends on what you're worried about, and who gets the blame for the wrong answer. As I understand it, it is the position of the core-devs that *any* seg fault in the interpreter or stdlib is a serious bug that must be fixed (possibly excepting ctypes); but if Python code returns garbage when you give it garbage input, that may or may not be considered a bug. In this case, passing a list with the flag set when it is not actually sorted correctly would be a "Garbage In, Garbage Out" error, just as if they had explicitly passed a already_sorted=True argument. But I take your earlier point that the argument version is explicit and opt-in, while the flag is implicit and may not be opt-in. Given all the points already raised, I think that an explicit SortedList might be more appropriate. Thanks everyone for all the feedback. -- Steven

On 4/8/19 6:25 AM, Steven D'Aprano wrote:
On Mon, Apr 08, 2019 at 02:55:54AM -0700, Nathaniel Smith wrote:
I'd rather have a separate function. I can't think of a good name for it right now, but it'd be just like quantiles, except that you'd pass it a sorted list. Conceptually, it'd fit in like this: def quantiles(data, n): if isinstance(data, list): # or some other appropriate test sorted_data = sorted(data) else: sorted_data = data quantiles_of_sorted_data(sorted_data, n) def quantiles_of_sorted_data(sorted_data, n): ...actual quantiles functionality here... Properly constructed programs that would have called quantiles repeatedly can call quantiles_of_sorted_data repeatedly instead. The ability to shoot a foot remains unchanged.

On Mon, Apr 8, 2019, 6:26 AM Steven D'Aprano <steve@pearwood.info> wrote:
Given all the points already raised, I think that an explicit SortedList might be more appropriate.
This one looks cool. I've read about it, but haven't used it: http://www.grantjenks.com/docs/sortedcontainers/ I think a "sort hint" should be a read-only attribute of some collections, and statistics should look for the presence of that. You might need to get third parties to follow that convention, but you're no worse than now if they don't. Burdening list with a vague suggestion about an invariant it doesn't maintain is a bad idea.

On 08Apr2019 12:32, Steven D'Aprano <steve@pearwood.info> wrote:
__setitem__ concerns me, along with other modification methods: what about subclasses(*)? Every existing subclass which overrides __setitem__ now needs to grow code to maintain __issorted__ if they do not themselves call list.__setitem__. Also, should this imply an issorted() builtin to consult an instance's __issorted__ dunder flag? Should such a builtin return False for instances without an __issorted__ flag? I'm thinking yes since the flag is intended to mean known-to-be-sorted. Cheers, Cameron Simpson <cs@cskk.id.au> * I _know_ subclassing builtins is discouraged, but it is supported and can be done if one is conservative.

On Mon, Apr 08, 2019 at 01:31:14PM +1000, Cameron Simpson wrote:
Well, yeah. That's what happens if you subclass: you are responsible for maintaining the invariant if you don't call the superclasses. Maybe there's a cunning way to avoid this, but that will make the implementation more complex and probably tips this proposal from "maybe worth thinking about" to "not a chance". But I'd be inclined to just pass the buck to the subclass: if you want to maintain the invariant, then you have to maintain it, or call the appropriate super methods. Now that you've reminded me of subclasses, I would make one other change to the specs: all lists, including empty ones, are created with the flag set to False. That way subclasses which *don't* maintain the invariant will always be flagged as unsorted (unless the caller explicitly sets the flag themselves).
I don't think this proposal is worth adding a builtin function. Not unless somebody thinks of some more very compelling use-cases. Perhaps an inspect.sort_hint() function. -- Steven

On 4/7/2019 10:32 PM, Steven D'Aprano wrote:
Does the CPython C-coded list structure have a unused bit that could be used for this? (I realized that other implementations might have a different answer.) Dunder names are not intended for directly use in code. If __issorted__ is a property, it could instead by .is_sorted or a new .is_sorted method, where is_sorted(bool) sets the property. -- Terry Jan Reedy

On 08Apr2019 00:17, Terry Reedy <tjreedy@udel.edu> wrote:
Dunders are ok to use in implementation code (class internal). I agree it's not nice to access from outside the class. I was imagining a builtin that calls somethings .__issorted__ method. But Steven's suggesting a flat dunder attribute rather than a callable method. If this is a publicly queriable value, is there any need to have a dunder name at all? Why not just give lists a public is_sorted attribute? I'm also not convinced the cost to every insert/append is worth the (subjectively to me) highly infrequent use. I imagine one could increase the utility of the flag by implementing insert/append with a bit of logic like: if self.__issorted__: check-previous/next elements to see if sortedness is preserved so that a list constructed in sorted order may keep the flag. However, it seems to me that such a list would accrue an O(n) cost with all of those O(1) checks over the whole construction, and so not be cheaper than just checking for sortedness aonce at the end before use. Conjecture: anything that requires a sorted list but _accepts_ an unsorted list (eg outer code which uses bisect or median) needs to check for sortedness and sort if not already sorted. So it feels to me like we've got an minimum O(n) cost regardless in most circumstances, including constructing a list in sorted order from scratch. And therefore the additional complexity added to insert/append/__setitem__ probably outweigh the value of valuing an O(1) check at the end; especially since while True is supposed to imply sortedness, False doesn't imply out-of-order, just that we need to check when sortedness is required. Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, 8 Apr 2019 at 06:19, Cameron Simpson <cs@cskk.id.au> wrote:
Morning all, I think a better abstraction for a sorted list is a new class, which implements the Sequence protocol (and hence can be used in a lot of existing list contexts), but only exposed mutation methods that can guarantee that sorted order can be maintained (and hence is _not_ a MutableSequence). AFAICT that means adding an `insert` method on top of the standard read-only methods of a list and can be implemented easily using the `bisect` module. I think this is a better option, as it allows you to rely on that sorted order, rather than it being a convention. Thanks, Alex

On Mon, Apr 08, 2019 at 07:44:41AM +0100, Alex Chamberlain wrote:
Perhaps that's a better idea.
(and hence is _not_ a MutableSequence).
Right, but it can still be mutable, so long as the mutation methods can maintain the invariant. That means: - the SortedList needs to know the sort direction; - and the key used for sorting; - no slice or item assignment; - insertions are okay, since the SortedList can put them in the correct place; - but not append; - deletions are okay, since they won't change the sort invariant (at least not for items with a total order). -- Steven

How do you handle a list of objects that after insertion and being sorted change their sort key and thus make the list unsorted. From the point of view of the list nothing changed, but its not sorted now. Think if a list of file stat info objects that are sorted by size. As the files are written to the size changes. I can loop over the objects and tell them to update the stats. Now the __is_sorted__ property is wrong. Barry

On Mon, 8 Apr 2019 at 22:07, Barry Scott <barry@barrys-emacs.org> wrote:
Evening Barry, I think you're right to recognise that as a risk, but I consider it similar to the risk of someone defining `__hash__` on a mutable object, putting it in a `set` then mutating it. We'd have to make the contract of `SortedList` clear, but I think it would be very difficult to stop the determined user from breaking the contract if they so wished. To be clear, that contract would be that although storing mutable objects is supported, once in the SortedList, a mutable object must not be changed in such a way as to change its key. Alex

On Mon, Apr 08, 2019 at 03:18:53PM +1000, Cameron Simpson wrote:
I don't mind that.
That's not practical unless the list remembers what sort of sort order it is supposed to have: - sorted smallest to biggest; - or biggest to smallest; - using what key. That might be appropriate for a dedicated SortedList class, but not for generic lists. But a SortedList class probably shouldn't support operations which break the sorted invariant.
Well, yes, somebody has to sort the list at least once, otherwise it won't be sorted :-) Currently bisect simply trusts that the list is sorted and makes no effort to even check. The API basically says: Pass a sorted list, or you'll get garbage. With this check in place, it is *possible* to change the API to: Pass a sorted list, or I'll sort it for you; if you lie, you'll get garbage. (Whether the maintainer of bisect thinks this is a good API change, I don't know.) The median (and soon, quantiles) API says: I'm going to sort the list for you, whether you need it or not, just in case you do. It could become: I'm going to sort the list for you, if necessary. If you lie about it already being sorted, you'll get garbage. -- Steven

On Mon, 8 Apr 2019 at 10:34, Steven D'Aprano <steve@pearwood.info> wrote:
Hmm, I didn't see that mentioned in the docs. It makes a difference to my comment about behaviour change, I think.
But regardless, this would be the same potential behaviour change I mentioned. (In the actual docs, it should probably be explicit about the sort order, as someone who passes a list sorted in the wrong way might not *think* they were lying, just that they hadn't sorted the list... Still not convinced this is safe enough to be worth it ;-) Paul

On Mon, Apr 8, 2019, 5:46 AM Paul Moore <p.f.moore@gmail.com> wrote:
Still not convinced this is safe enough to be worth it ;-)
I'm convinced it's NOT safe enough to be worth it. On the other hand, a sortedlist subclass that maintained its invariant (probably remembering a key) sounds cool. I think there are one or more good ones on PyPI. Statistics should simply learn to recognize external always-sorted data structures.

Basically, I like this idea and of course theoretically it has use cases. But in the proposed form, it feels strongly curtailed. In practice, what does it mean to be sorted? [('a', 1),('b', 2),('bb', 4),('aaa', 3)] # Is this list sorted? [('a', 1),('aaa', 3),('b', 2),('bb', 4)] # or this? [('a', 1),('b', 2),('aaa', 3),('bb', 4)] # or that? The right answer is that they are all sorted by some means. So, if you offer this __is_sorted__ attribute only for a very special case 1d list of numbers - this makes no sense. (Just re-read the recent thread, why .join is the string method and not the *list *method). On the other hand If you offer *some sort of a general protocol* storing a sort key or some other useful information, then this is awesome! But is this achievable in practice? with kind regards, -gdg пн, 8 апр. 2019 г. в 05:38, Steven D'Aprano <steve@pearwood.info>:

looks for me that means that all subclass of list has to maintain the __isorted__ invariant, looks like not a backward compatible modification and quite problematic in case of the invariant broken. So if I'm correct the design is brittle. Moreover, that solve only the strict subset of sort case where there is no key function, in ascendant order, .... If one want to make a subclass of list having this properties, that don't look hard to do and far more safer. (of course the problem is that there is two similar list classes and the best one is not builtin) Le 08/04/2019 à 04:32, Steven D'Aprano a écrit :

It surely would be helpful. I still find it a bit too single-case oriented. Maybe having an equivalent __sorting_algo__ property with a value of the current sorting algorithm would be more general? There could be a SortingAlgo base class, which could be extended into classes like: - SortingAlgoNone(SortingAlgo) or SortingAlgoUnsorted(SortingAlgo) which would be the default for non-sorted lists (or just the value None) - SortingAlgoAscending(SortingAlgo) - SortingAlgoAscendingNumbers(SortingAlgoAscending) - MyCustomSortingAlgo(SortingAlgo) - ... It would allow to mark a list as sorted with any algorithm, and of course, any code that would use these lists would be able to read/write this __sorting_algo__. And complementary idea, we could have an extra arg to sort() (and other functions like this one) like `trust_declared_algo=True`, that if set would only sort the list if its list.__sorting_algo__ is compatible (a subclass of the sorting algo it uses, or the class itself). The rest of the behaviours (when the __sorting_algo__ would be set or reset) would be as described by Steven in the original proposal. -Brice Le 8/4/19 à 4:32, Steven D'Aprano a écrit :

I are the appeal, but this really seems too special purpose for yet another dunder. A SortedList might be a good addition to the statistics package, however. I note that with regard to sorting, I suggested a __sort_key dunder so that the sorting routines could know how to efficiently sort arbitrary objects. It was pretty soundly rejected as adding too much overhead for a special case. This is different, but still a special case. It’s easy to think that everyone uses the language like you do, but needling pre-sorted lists is really a pretty special case. And if performance really matters, you should probably be using numpy anyway. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Sun, Apr 7, 2019 at 7:37 PM Steven D'Aprano <steve@pearwood.info> wrote:
But this flag doesn't affect those modules, right? 'bisect' already requires the user to ensure that the list is sorted appropriately, and this bit:
...seems to mean that the 'statistics' module can't use this flag either. It doesn't seem very likely to me that the savings from this flag could outweigh the extra overhead it introduces, just because list operations are *so* common in Python. If you want to push this forward, the thing I'd most like to see is some kind of measurements to demonstrate that average programs will benefit. -n -- Nathaniel J. Smith -- https://vorpus.org

On Sun, Apr 07, 2019 at 08:26:24PM -0700, Nathaniel Smith wrote:
Naturally the bisect and statistics modules (or any other that requires sorting) won't change to inspect this flag by magic, the code will require modification. Possibly the maintainer of bisect may decide that its not worth the change. But for the statistics module, I would certainly change the implementation of median() to look something vaguely like this: # was data = sorted(data) # may be expensive if data is large # may become if not (isinstance(data, list) and data.__issorted__): data = sorted(data) statistics is soon to grow a quantiles() function, but the thing with quantiles is often you want to get a bunch of them: # This only makes sense if data is a sequence (list) # not an iterator. quartiles = quantiles(data, n=4) quintiles = quantiles(data, n=5) deciles = quantiles(data, n=10) percentiles = quantiles(data, n=100) That's four calls to sorted(). The caller could drop that down to one: data.sort() quartiles = ... etc Now before anyone else mentions it, we could give the function a "dont_sort" argument, or "already_sorted" if you prefer, but I dislike that kind of constant-bool parameter and would prefer to avoid it.
"Consenting adults" applies. If you want to pass an unsorted list to the functions, but pretend that its sorted, then on your own head be it. There's no real difference between these two hypothetical scenarios: data = [1, 4, 2, 0, 5, 3] garbage = median(data, already_sorted=True) versus: data = [1, 4, 2, 0, 5, 3] data.__issorted__ = True garbage = median(data) I'm perfectly comfortable with allowing the caller to lie if they want. Its their own foot they're shooting. (I wouldn't be so blasé about this if it were a function written in C that could segfault if the list wasn't sorted.)
I'm not sure that the average program uses sort *at all*, so a better set of questions are: - how much will this hurt the average program? (my gut feeling is "very little", but we'd need benchmarks to know) - are there any other use-cases for sorted data that could benefit? - how much will they benefit? Let's say, for the sake of the argument that this proposal makes the average program 0.01% slower, but helps sorting-heavy programs be 2% faster when dealing with large lists, then I think that might be a win. (Remember, this is about large lists. For small enough lists, the cost of sorting them in minor.) I can't benchmark the cost of setting the flag accurately (it would want to be done in C, doing it in pure Python using a subclass is very expensive). But for what its worth, on my old and slow computer, it takes 13.5 seconds to sort in place this list of 50 million integers: L = list(range(10000000))*5 and 15 seconds to make a sorted copy. Once sorted, subsequent sorts are faster, but still time consuming, about 2.5 seconds for an in-place sort. So on my machine, this will save about 2-3 seconds per sort that I can avoid doing. More if the list is copied. (I know it is totally unfair to benchmark the benefit without also benchmarking the cost. Deal with it :-) -- Steven

On Mon, 8 Apr 2019 at 10:10, Steven D'Aprano <steve@pearwood.info> wrote:
So just to be clear, this would be a change in behaviour - for a list that is currently sorted on a key that is *not* "numerically ascending", the code will no longer re-sort, and so will give the wrong answer? (Maybe the code is able to deal with either ascending or descending orders, I don't know about that, but even so, that just makes the failure rarer, not non-existent). I'm not saying that this is forbidden, just want to be clear if that's what you mean (because my difficulty with the proposed attribute is that it seems unreliable enough that I can't imagine a case where I'd feel comfortable using it myself). Paul PS I thought timsort was highly efficient given already sorted data? Whether it's OK to rely on that I don't know, but did you take that into account?

On Mon, Apr 08, 2019 at 10:34:19AM +0100, Paul Moore wrote:
Correct. A thought comes to mind... Currently the two variant forms of median (median_low and _high) can work with non-numeric data: # don't take this example too seriously py> statistics.median_low(['low', 'high', 'high', 'low', 'very low']) 'low' so perhaps there is a use-case for non-numeric sorts.
Fair enough, but does it help you feel a bit better about the feature if we called it a "sort hint", and emphasized that it should only be used in the same sort of APIs where you might allow the caller to pass already_sorted=True to skip a redundant sort step?
Yes, timsort is very efficient with already sorted data, and I have :-) On my computer, to sort 50 million integers in place takes about 13.5 seconds; to sort it the next time takes 2.7 seconds. (That's effectively just galloping along the list, looking for out-of-place items and not finding any.) To anyone tempted to say "Get a better computer", if I had a better computer I'd be using a list of 50 billion integers and it would still take 2.7 seconds :-) -- Steven

On Mon, 8 Apr 2019 at 11:27, Steven D'Aprano <steve@pearwood.info> wrote:
Not really, because already_sorted=True is an explicit claim by the user, whereas the (internally maintained) dunder attribute is the interpreter maintaining a (not guaranteed reliable but pessimistic to be as safe as possible) view on whether the data is sorted. But I see elsewhere in the thread that you're now more inclined towards an explicit already_sorted flag, so I won't labour the point. Paul

On Mon, Apr 8, 2019, 02:09 Steven D'Aprano <steve@pearwood.info> wrote:
Right, by "doesn't affect" I meant "cannot get any benefit, even if their code is modified". Possibly the maintainer of bisect may decide that its not worth the
If only we had some kind of API that could compute multiple quantiles at the same time...
An already_sorted=True argument would be an explicit opt in, and consenting adults would apply. But your message was very explicit that __issorted__ can be set implicitly, though. For example, this would give garbage results: # implicitly sets the sorted flag data.sort() # preserves the flag, because hey it's sorted by *some* key data.reverse() statistics.median(data) You can't use this in statistics.median because it would break compatibility. Also, isn't the whole point of 'statistics' to be the simple, reliable module for folks who aren't that worried about speed? This seems like a massive footgun.
(I wouldn't be so blasé about this if it were a function written in C that could segfault if the list wasn't sorted.)
Silently giving the wrong answer is way worse than a segfault.
Obviously these are made up numbers, but if they were real then for it to be a net win you would still need at least 1 in 200 programs to be "sorting heavy" in a way that could benefit from this flag, and I don't believe that's true. -n

On Mon, Apr 08, 2019 at 02:55:54AM -0700, Nathaniel Smith wrote:
Right, by "doesn't affect" I meant "cannot get any benefit, even if their code is modified".
Ah, sorry I misunderstood you.
You mean something like this? quartiles, quintiles, deciles, percentiles = quantiles(data, n=(4, 5, 10, 100)) Yuck :-) I'd rather allow an already_sorted=True parameter :-)
It would certainly be putting more responsibility on the caller to ensure the sorted flag was correct.
You can't use this in statistics.median because it would break compatibility.
I would argue differently.
Perhaps. Perhaps not as massive as it seems. The expected use-case would be something like this: data = get_data() a = median(data) data.sort() b = median(data) # Like a, but faster. To be a problem, the caller would need to do something like: data = get_data() a = median(data) data.sort(reversed=True) # or some weird key function b = median(data) # Garbage result. I can't see people shooting themselves in the foot in this way by accident very often. But fair enough, it is a risk.
It depends on what you're worried about, and who gets the blame for the wrong answer. As I understand it, it is the position of the core-devs that *any* seg fault in the interpreter or stdlib is a serious bug that must be fixed (possibly excepting ctypes); but if Python code returns garbage when you give it garbage input, that may or may not be considered a bug. In this case, passing a list with the flag set when it is not actually sorted correctly would be a "Garbage In, Garbage Out" error, just as if they had explicitly passed a already_sorted=True argument. But I take your earlier point that the argument version is explicit and opt-in, while the flag is implicit and may not be opt-in. Given all the points already raised, I think that an explicit SortedList might be more appropriate. Thanks everyone for all the feedback. -- Steven

On 4/8/19 6:25 AM, Steven D'Aprano wrote:
On Mon, Apr 08, 2019 at 02:55:54AM -0700, Nathaniel Smith wrote:
I'd rather have a separate function. I can't think of a good name for it right now, but it'd be just like quantiles, except that you'd pass it a sorted list. Conceptually, it'd fit in like this: def quantiles(data, n): if isinstance(data, list): # or some other appropriate test sorted_data = sorted(data) else: sorted_data = data quantiles_of_sorted_data(sorted_data, n) def quantiles_of_sorted_data(sorted_data, n): ...actual quantiles functionality here... Properly constructed programs that would have called quantiles repeatedly can call quantiles_of_sorted_data repeatedly instead. The ability to shoot a foot remains unchanged.

On Mon, Apr 8, 2019, 6:26 AM Steven D'Aprano <steve@pearwood.info> wrote:
Given all the points already raised, I think that an explicit SortedList might be more appropriate.
This one looks cool. I've read about it, but haven't used it: http://www.grantjenks.com/docs/sortedcontainers/ I think a "sort hint" should be a read-only attribute of some collections, and statistics should look for the presence of that. You might need to get third parties to follow that convention, but you're no worse than now if they don't. Burdening list with a vague suggestion about an invariant it doesn't maintain is a bad idea.

On 08Apr2019 12:32, Steven D'Aprano <steve@pearwood.info> wrote:
__setitem__ concerns me, along with other modification methods: what about subclasses(*)? Every existing subclass which overrides __setitem__ now needs to grow code to maintain __issorted__ if they do not themselves call list.__setitem__. Also, should this imply an issorted() builtin to consult an instance's __issorted__ dunder flag? Should such a builtin return False for instances without an __issorted__ flag? I'm thinking yes since the flag is intended to mean known-to-be-sorted. Cheers, Cameron Simpson <cs@cskk.id.au> * I _know_ subclassing builtins is discouraged, but it is supported and can be done if one is conservative.

On Mon, Apr 08, 2019 at 01:31:14PM +1000, Cameron Simpson wrote:
Well, yeah. That's what happens if you subclass: you are responsible for maintaining the invariant if you don't call the superclasses. Maybe there's a cunning way to avoid this, but that will make the implementation more complex and probably tips this proposal from "maybe worth thinking about" to "not a chance". But I'd be inclined to just pass the buck to the subclass: if you want to maintain the invariant, then you have to maintain it, or call the appropriate super methods. Now that you've reminded me of subclasses, I would make one other change to the specs: all lists, including empty ones, are created with the flag set to False. That way subclasses which *don't* maintain the invariant will always be flagged as unsorted (unless the caller explicitly sets the flag themselves).
I don't think this proposal is worth adding a builtin function. Not unless somebody thinks of some more very compelling use-cases. Perhaps an inspect.sort_hint() function. -- Steven

On 4/7/2019 10:32 PM, Steven D'Aprano wrote:
Does the CPython C-coded list structure have a unused bit that could be used for this? (I realized that other implementations might have a different answer.) Dunder names are not intended for directly use in code. If __issorted__ is a property, it could instead by .is_sorted or a new .is_sorted method, where is_sorted(bool) sets the property. -- Terry Jan Reedy

On 08Apr2019 00:17, Terry Reedy <tjreedy@udel.edu> wrote:
Dunders are ok to use in implementation code (class internal). I agree it's not nice to access from outside the class. I was imagining a builtin that calls somethings .__issorted__ method. But Steven's suggesting a flat dunder attribute rather than a callable method. If this is a publicly queriable value, is there any need to have a dunder name at all? Why not just give lists a public is_sorted attribute? I'm also not convinced the cost to every insert/append is worth the (subjectively to me) highly infrequent use. I imagine one could increase the utility of the flag by implementing insert/append with a bit of logic like: if self.__issorted__: check-previous/next elements to see if sortedness is preserved so that a list constructed in sorted order may keep the flag. However, it seems to me that such a list would accrue an O(n) cost with all of those O(1) checks over the whole construction, and so not be cheaper than just checking for sortedness aonce at the end before use. Conjecture: anything that requires a sorted list but _accepts_ an unsorted list (eg outer code which uses bisect or median) needs to check for sortedness and sort if not already sorted. So it feels to me like we've got an minimum O(n) cost regardless in most circumstances, including constructing a list in sorted order from scratch. And therefore the additional complexity added to insert/append/__setitem__ probably outweigh the value of valuing an O(1) check at the end; especially since while True is supposed to imply sortedness, False doesn't imply out-of-order, just that we need to check when sortedness is required. Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, 8 Apr 2019 at 06:19, Cameron Simpson <cs@cskk.id.au> wrote:
Morning all, I think a better abstraction for a sorted list is a new class, which implements the Sequence protocol (and hence can be used in a lot of existing list contexts), but only exposed mutation methods that can guarantee that sorted order can be maintained (and hence is _not_ a MutableSequence). AFAICT that means adding an `insert` method on top of the standard read-only methods of a list and can be implemented easily using the `bisect` module. I think this is a better option, as it allows you to rely on that sorted order, rather than it being a convention. Thanks, Alex

On Mon, Apr 08, 2019 at 07:44:41AM +0100, Alex Chamberlain wrote:
Perhaps that's a better idea.
(and hence is _not_ a MutableSequence).
Right, but it can still be mutable, so long as the mutation methods can maintain the invariant. That means: - the SortedList needs to know the sort direction; - and the key used for sorting; - no slice or item assignment; - insertions are okay, since the SortedList can put them in the correct place; - but not append; - deletions are okay, since they won't change the sort invariant (at least not for items with a total order). -- Steven

How do you handle a list of objects that after insertion and being sorted change their sort key and thus make the list unsorted. From the point of view of the list nothing changed, but its not sorted now. Think if a list of file stat info objects that are sorted by size. As the files are written to the size changes. I can loop over the objects and tell them to update the stats. Now the __is_sorted__ property is wrong. Barry

On Mon, 8 Apr 2019 at 22:07, Barry Scott <barry@barrys-emacs.org> wrote:
Evening Barry, I think you're right to recognise that as a risk, but I consider it similar to the risk of someone defining `__hash__` on a mutable object, putting it in a `set` then mutating it. We'd have to make the contract of `SortedList` clear, but I think it would be very difficult to stop the determined user from breaking the contract if they so wished. To be clear, that contract would be that although storing mutable objects is supported, once in the SortedList, a mutable object must not be changed in such a way as to change its key. Alex

On Mon, Apr 08, 2019 at 03:18:53PM +1000, Cameron Simpson wrote:
I don't mind that.
That's not practical unless the list remembers what sort of sort order it is supposed to have: - sorted smallest to biggest; - or biggest to smallest; - using what key. That might be appropriate for a dedicated SortedList class, but not for generic lists. But a SortedList class probably shouldn't support operations which break the sorted invariant.
Well, yes, somebody has to sort the list at least once, otherwise it won't be sorted :-) Currently bisect simply trusts that the list is sorted and makes no effort to even check. The API basically says: Pass a sorted list, or you'll get garbage. With this check in place, it is *possible* to change the API to: Pass a sorted list, or I'll sort it for you; if you lie, you'll get garbage. (Whether the maintainer of bisect thinks this is a good API change, I don't know.) The median (and soon, quantiles) API says: I'm going to sort the list for you, whether you need it or not, just in case you do. It could become: I'm going to sort the list for you, if necessary. If you lie about it already being sorted, you'll get garbage. -- Steven

On Mon, 8 Apr 2019 at 10:34, Steven D'Aprano <steve@pearwood.info> wrote:
Hmm, I didn't see that mentioned in the docs. It makes a difference to my comment about behaviour change, I think.
But regardless, this would be the same potential behaviour change I mentioned. (In the actual docs, it should probably be explicit about the sort order, as someone who passes a list sorted in the wrong way might not *think* they were lying, just that they hadn't sorted the list... Still not convinced this is safe enough to be worth it ;-) Paul

On Mon, Apr 8, 2019, 5:46 AM Paul Moore <p.f.moore@gmail.com> wrote:
Still not convinced this is safe enough to be worth it ;-)
I'm convinced it's NOT safe enough to be worth it. On the other hand, a sortedlist subclass that maintained its invariant (probably remembering a key) sounds cool. I think there are one or more good ones on PyPI. Statistics should simply learn to recognize external always-sorted data structures.

Basically, I like this idea and of course theoretically it has use cases. But in the proposed form, it feels strongly curtailed. In practice, what does it mean to be sorted? [('a', 1),('b', 2),('bb', 4),('aaa', 3)] # Is this list sorted? [('a', 1),('aaa', 3),('b', 2),('bb', 4)] # or this? [('a', 1),('b', 2),('aaa', 3),('bb', 4)] # or that? The right answer is that they are all sorted by some means. So, if you offer this __is_sorted__ attribute only for a very special case 1d list of numbers - this makes no sense. (Just re-read the recent thread, why .join is the string method and not the *list *method). On the other hand If you offer *some sort of a general protocol* storing a sort key or some other useful information, then this is awesome! But is this achievable in practice? with kind regards, -gdg пн, 8 апр. 2019 г. в 05:38, Steven D'Aprano <steve@pearwood.info>:

looks for me that means that all subclass of list has to maintain the __isorted__ invariant, looks like not a backward compatible modification and quite problematic in case of the invariant broken. So if I'm correct the design is brittle. Moreover, that solve only the strict subset of sort case where there is no key function, in ascendant order, .... If one want to make a subclass of list having this properties, that don't look hard to do and far more safer. (of course the problem is that there is two similar list classes and the best one is not builtin) Le 08/04/2019 à 04:32, Steven D'Aprano a écrit :

It surely would be helpful. I still find it a bit too single-case oriented. Maybe having an equivalent __sorting_algo__ property with a value of the current sorting algorithm would be more general? There could be a SortingAlgo base class, which could be extended into classes like: - SortingAlgoNone(SortingAlgo) or SortingAlgoUnsorted(SortingAlgo) which would be the default for non-sorted lists (or just the value None) - SortingAlgoAscending(SortingAlgo) - SortingAlgoAscendingNumbers(SortingAlgoAscending) - MyCustomSortingAlgo(SortingAlgo) - ... It would allow to mark a list as sorted with any algorithm, and of course, any code that would use these lists would be able to read/write this __sorting_algo__. And complementary idea, we could have an extra arg to sort() (and other functions like this one) like `trust_declared_algo=True`, that if set would only sort the list if its list.__sorting_algo__ is compatible (a subclass of the sorting algo it uses, or the class itself). The rest of the behaviours (when the __sorting_algo__ would be set or reset) would be as described by Steven in the original proposal. -Brice Le 8/4/19 à 4:32, Steven D'Aprano a écrit :

I are the appeal, but this really seems too special purpose for yet another dunder. A SortedList might be a good addition to the statistics package, however. I note that with regard to sorting, I suggested a __sort_key dunder so that the sorting routines could know how to efficiently sort arbitrary objects. It was pretty soundly rejected as adding too much overhead for a special case. This is different, but still a special case. It’s easy to think that everyone uses the language like you do, but needling pre-sorted lists is really a pretty special case. And if performance really matters, you should probably be using numpy anyway. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
participants (13)
-
Alex Chamberlain
-
Barry Scott
-
Brice Parent
-
Cameron Simpson
-
Christopher Barker
-
Dan Sommers
-
David Mertz
-
Kirill Balunov
-
Nathaniel Smith
-
Paul Moore
-
Steven D'Aprano
-
Terry Reedy
-
Xavier Combelle