[Python-ideas] Sorted lists

Steven D'Aprano steve at pearwood.info
Mon Apr 8 05:09:20 EDT 2019

On Sun, Apr 07, 2019 at 08:26:24PM -0700, Nathaniel Smith wrote:
> On Sun, Apr 7, 2019 at 7:37 PM Steven D'Aprano <steve at pearwood.info> wrote:
> > There are quite a few important algorithms which require lists to be
> > sorted. For example, the bisect module, and for statistics median and
> > other quantiles.
> But this flag doesn't affect those modules, right? 'bisect' already
> requires the user to ensure that the list is sorted appropriately

Naturally the bisect and statistics modules (or any other that requires 
sorting) won't change to inspect this flag by magic, the code will 
require modification.

Possibly the maintainer of bisect may decide that its not worth the 
change. But for the statistics module, I would certainly change the 
implementation of median() to look something vaguely like this:

    # was
    data = sorted(data)  # may be expensive if data is large

    # may become
    if not (isinstance(data, list) and data.__issorted__):
        data = sorted(data)

statistics is soon to grow a quantiles() function, but the thing with 
quantiles is often you want to get a bunch of them:

    # This only makes sense if data is a sequence (list)
    # not an iterator.
    quartiles = quantiles(data, n=4)
    quintiles = quantiles(data, n=5)
    deciles = quantiles(data, n=10)
    percentiles = quantiles(data, n=100)

That's four calls to sorted(). The caller could drop that down to one:

    quartiles = ... etc

Now before anyone else mentions it, we could give the function a 
"dont_sort" argument, or "already_sorted" if you prefer, but I dislike 
that kind of constant-bool parameter and would prefer to avoid it.

> and this bit:
> > The flag doesn't guarantee that the list is sorted the way you want
> > (e.g. biggest to smallest, by some key, etc) only that it has been
> > sorted. Its up to the user to ensure they sort it the right way:
> ...seems to mean that the 'statistics' module can't use this flag either.

"Consenting adults" applies. If you want to pass an unsorted list to the 
functions, but pretend that its sorted, then on your own head be it. 
There's no real difference between these two hypothetical scenarios:

    data = [1, 4, 2, 0, 5, 3]
    garbage = median(data, already_sorted=True)


    data = [1, 4, 2, 0, 5, 3]
    data.__issorted__ = True
    garbage = median(data)

I'm perfectly comfortable with allowing the caller to lie if they want. 
Its their own foot they're shooting.

(I wouldn't be so blasé about this if it were a function written in C 
that could segfault if the list wasn't sorted.)

> It doesn't seem very likely to me that the savings from this flag
> could outweigh the extra overhead it introduces, just because list
> operations are *so* common in Python. If you want to push this
> forward, the thing I'd most like to see is some kind of measurements
> to demonstrate that average programs will benefit.

I'm not sure that the average program uses sort *at all*, so a better 
set of questions are:

- how much will this hurt the average program?
  (my gut feeling is "very little", but we'd need benchmarks to know)

- are there any other use-cases for sorted data that could benefit?

- how much will they benefit?

Let's say, for the sake of the argument that this proposal makes the 
average program 0.01% slower, but helps sorting-heavy programs be 2% 
faster when dealing with large lists, then I think that might be a win.

(Remember, this is about large lists. For small enough lists, the cost 
of sorting them in minor.)

I can't benchmark the cost of setting the flag accurately (it would want 
to be done in C, doing it in pure Python using a subclass is very 
expensive). But for what its worth, on my old and slow computer, it 
takes 13.5 seconds to sort in place this list of 50 million integers:

    L = list(range(10000000))*5

and 15 seconds to make a sorted copy. Once sorted, subsequent sorts are 
faster, but still time consuming, about 2.5 seconds for an in-place 

So on my machine, this will save about 2-3 seconds per sort that I can 
avoid doing. More if the list is copied.

(I know it is totally unfair to benchmark the benefit without also 
benchmarking the cost. Deal with it :-)


More information about the Python-ideas mailing list