[Python-ideas] Sorted lists
steve at pearwood.info
Mon Apr 8 05:09:20 EDT 2019
On Sun, Apr 07, 2019 at 08:26:24PM -0700, Nathaniel Smith wrote:
> On Sun, Apr 7, 2019 at 7:37 PM Steven D'Aprano <steve at pearwood.info> wrote:
> > There are quite a few important algorithms which require lists to be
> > sorted. For example, the bisect module, and for statistics median and
> > other quantiles.
> But this flag doesn't affect those modules, right? 'bisect' already
> requires the user to ensure that the list is sorted appropriately
Naturally the bisect and statistics modules (or any other that requires
sorting) won't change to inspect this flag by magic, the code will
Possibly the maintainer of bisect may decide that its not worth the
change. But for the statistics module, I would certainly change the
implementation of median() to look something vaguely like this:
data = sorted(data) # may be expensive if data is large
# may become
if not (isinstance(data, list) and data.__issorted__):
data = sorted(data)
statistics is soon to grow a quantiles() function, but the thing with
quantiles is often you want to get a bunch of them:
# This only makes sense if data is a sequence (list)
# not an iterator.
quartiles = quantiles(data, n=4)
quintiles = quantiles(data, n=5)
deciles = quantiles(data, n=10)
percentiles = quantiles(data, n=100)
That's four calls to sorted(). The caller could drop that down to one:
quartiles = ... etc
Now before anyone else mentions it, we could give the function a
"dont_sort" argument, or "already_sorted" if you prefer, but I dislike
that kind of constant-bool parameter and would prefer to avoid it.
> and this bit:
> > The flag doesn't guarantee that the list is sorted the way you want
> > (e.g. biggest to smallest, by some key, etc) only that it has been
> > sorted. Its up to the user to ensure they sort it the right way:
> ...seems to mean that the 'statistics' module can't use this flag either.
"Consenting adults" applies. If you want to pass an unsorted list to the
functions, but pretend that its sorted, then on your own head be it.
There's no real difference between these two hypothetical scenarios:
data = [1, 4, 2, 0, 5, 3]
garbage = median(data, already_sorted=True)
data = [1, 4, 2, 0, 5, 3]
data.__issorted__ = True
garbage = median(data)
I'm perfectly comfortable with allowing the caller to lie if they want.
Its their own foot they're shooting.
(I wouldn't be so blasé about this if it were a function written in C
that could segfault if the list wasn't sorted.)
> It doesn't seem very likely to me that the savings from this flag
> could outweigh the extra overhead it introduces, just because list
> operations are *so* common in Python. If you want to push this
> forward, the thing I'd most like to see is some kind of measurements
> to demonstrate that average programs will benefit.
I'm not sure that the average program uses sort *at all*, so a better
set of questions are:
- how much will this hurt the average program?
(my gut feeling is "very little", but we'd need benchmarks to know)
- are there any other use-cases for sorted data that could benefit?
- how much will they benefit?
Let's say, for the sake of the argument that this proposal makes the
average program 0.01% slower, but helps sorting-heavy programs be 2%
faster when dealing with large lists, then I think that might be a win.
(Remember, this is about large lists. For small enough lists, the cost
of sorting them in minor.)
I can't benchmark the cost of setting the flag accurately (it would want
to be done in C, doing it in pure Python using a subclass is very
expensive). But for what its worth, on my old and slow computer, it
takes 13.5 seconds to sort in place this list of 50 million integers:
L = list(range(10000000))*5
and 15 seconds to make a sorted copy. Once sorted, subsequent sorts are
faster, but still time consuming, about 2.5 seconds for an in-place
So on my machine, this will save about 2-3 seconds per sort that I can
avoid doing. More if the list is copied.
(I know it is totally unfair to benchmark the benefit without also
benchmarking the cost. Deal with it :-)
More information about the Python-ideas