[issue21592] Make statistics.median run in linear time
Steven D'Aprano
report at bugs.python.org
Sun Jun 1 06:13:55 CEST 2014
Steven D'Aprano added the comment:
I've run some performance tests on six variations of the O(N) select algorithm, based on Tim Peters' and Thomas Ahle's code, comparing them to the naive O(N log N) "sort first" algorithm, and sorting is consistently faster up to the limit I tested.
About the tests I ran:
- I tested four versions of Tim's median-of-median-of-k
algorithm, for k = 7, 23, 47 and 97.
- Thomas' "select" function, which is a median-of-median-of-3.
- Thomas' "select2" function, which uses two pivots.
- Data was randomly shuffled.
- Functions were permitted to modify the data in place, and
were not required to make a copy of the data first. E.g.
I used alist.sort() rather than sorted(alist).
- I ran two separate sets of tests. The first tested individual
calls to the various selection functions, on random data. Each
function got its own copy of the shuffled data.
- The second set of tests called the selection function three
times in a row, using different ranks, and used the average
of the three times.
My test suite is attached if anyone wants to critique it or run it themselves.
Results:
== Single call mode ==
N sort select7 select23 select47 select97 select select2
-------- -------- -------- -------- -------- -------- -------- --------
5000 0.001 0.027 0.004 0.003 0.003 0.005 0.002
10000 0.002 0.008 0.006 0.005 0.005 0.007 0.006
50000 0.014 0.041 0.029 0.027 0.028 0.039 0.035
100000 0.035 0.088 0.069 0.065 0.067 0.132 0.067
500000 0.248 0.492 0.352 0.349 0.345 0.378 0.433
1000000 0.551 1.008 0.768 0.669 0.723 1.007 0.627
2000000 1.173 2.004 1.791 1.335 1.376 3.049 1.108
3000000 1.992 3.282 2.291 2.256 2.299 2.451 1.756
4000000 2.576 4.135 3.130 2.960 2.937 5.022 3.318
5000000 3.568 5.233 3.914 3.504 3.629 4.912 4.458
6000000 4.237 6.233 4.710 4.323 4.514 5.066 3.876
7000000 4.962 7.403 5.447 5.037 5.129 7.053 7.774
8000000 5.854 8.696 6.151 5.963 5.908 8.704 5.836
9000000 6.749 9.540 7.078 6.869 6.985 6.354 3.834
10000000 7.667 10.944 7.621 7.322 7.439 10.092 7.112
11000000 8.400 11.966 8.566 8.284 8.112 10.511 8.184
Total elapsed time: 23.84 minutes
My conclusions from single calls:
Thomas' select() and Tim's select7() as pure Python functions are too slow for serious contention. [Aside: I wonder how PyPy would go with them?]
There's not much difference in performance between the various median-of-median-of-k functions for larger k, but it seems to me that overall k=47 is marginally faster than either k=23 or k=97.
Overall, sorting is as good or better (and usually *much better*) than any of the pure-Python functions for the values of N tested, at least on my computer. C versions may be worth testing, but I'm afraid that is beyond me. Thomas' select2 using dual pivots seems like the most promising.
There are a couple of anomalous results where select2 unexpectedly (to me!) does much, much better than sorting, e.g. for N=9 million. Pure chance perhaps?
The overall trend seems to me to suggest that a pure-Python version of select2 may become reliably faster than sorting from N=10 million or so, at least with random data on my computer. YMMV, and I would expect that will non-random partially sorted data, the results may be considerably different.
== Average of three calls mode ==
N sort select7 select23 select47 select97 select select2
-------- -------- -------- -------- -------- -------- -------- --------
5000 0.001 0.012 0.007 0.008 0.007 0.022 0.007
10000 0.002 0.022 0.015 0.015 0.015 0.041 0.016
50000 0.016 0.125 0.086 0.080 0.085 0.259 0.073
100000 0.037 0.258 0.181 0.155 0.156 0.650 0.137
500000 0.242 1.374 0.950 0.963 1.075 4.828 1.135
1000000 0.564 2.892 1.998 1.952 2.100 5.055 1.721
2000000 1.227 5.822 4.084 3.876 4.070 18.535 3.379
3000000 2.034 8.825 6.264 6.256 5.798 29.206 4.851
4000000 2.761 12.275 8.209 7.767 9.111 38.186 8.899
5000000 3.587 14.829 10.289 10.385 10.685 53.101 8.149
6000000 4.320 17.926 12.925 12.455 12.639 73.876 10.336
7000000 5.237 21.504 15.221 14.740 16.167 87.315 12.254
8000000 6.145 24.503 16.918 15.761 18.430 103.394 16.923
9000000 6.947 26.801 19.993 18.755 20.676 106.303 16.444
10000000 8.113 30.933 21.352 20.341 20.417 102.421 16.987
11000000 9.031 33.912 24.676 23.624 22.448 114.279 18.698
Total elapsed time: 81.39 minutes
In this set of tests, each function is called three times on the same set of data. As expected, once the list is sorted on the first call, sorting it again on the second call is very fast, and so the "sort" column is quite similar to the previous set of tests.
What I didn't expect is just how badly the various other selection functions cope with being called three times on the same list with different ranks. The extreme case is Thomas' select() function. Total time to call it three times on a list of 11 million items is 342 seconds (3*114), compared to 10 seconds to call it once. I expected that having partially ordered the data on the first call, the second and third calls would take less time rather than more. Was I ever wrong. Unless my analysis is wrong, something bad is happening here, and I don't know what it is.
[Aside: this suggests that, unlike sort() which can take advantage of partially ordered data to be more efficient, the other selection functions are hurt by partially ordered data. Is this analogous to simple versions of Quicksort which degrade to O(N**2) if the data is already sorted?]
What is abundantly clear is that if you want to make more than one selection from a list, you ought to sort it first.
Given these results, I do not believe that a pure-python implementation of any of these selection algorithms can be justified on performance grounds for CPython.
Thanks to Tim Peters and Thomas Ahle for their valuable assistance in writing the selection functions in the first place.
----------
Added file: http://bugs.python.org/file35423/select.py
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue21592>
_______________________________________
More information about the Python-bugs-list
mailing list