[issue21592] Make statistics.median run in linear time

Sun Jun 1 06:13:55 CEST 2014

Steven D'Aprano added the comment:

I've run some performance tests on six variations of the O(N) select algorithm, based on Tim Peters' and Thomas Ahle's code, comparing them to the naive O(N log N) "sort first" algorithm, and sorting is consistently faster up to the limit I tested.

About the tests I ran:

- I tested four versions of Tim's median-of-median-of-k 
  algorithm, for k = 7, 23, 47 and 97.

- Thomas' "select" function, which is a median-of-median-of-3.

- Thomas' "select2" function, which uses two pivots.

- Data was randomly shuffled.

- Functions were permitted to modify the data in place, and 
  were not required to make a copy of the data first. E.g. 
  I used alist.sort() rather than sorted(alist).

- I ran two separate sets of tests. The first tested individual
  calls to the various selection functions, on random data. Each
  function got its own copy of the shuffled data. 

- The second set of tests called the selection function three
  times in a row, using different ranks, and used the average
  of the three times.

My test suite is attached if anyone wants to critique it or run it themselves.

Results:

== Single call mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.001    0.027    0.004    0.003    0.003    0.005    0.002
   10000    0.002    0.008    0.006    0.005    0.005    0.007    0.006
   50000    0.014    0.041    0.029    0.027    0.028    0.039    0.035
  100000    0.035    0.088    0.069    0.065    0.067    0.132    0.067
  500000    0.248    0.492    0.352    0.349    0.345    0.378    0.433
 1000000    0.551    1.008    0.768    0.669    0.723    1.007    0.627
 2000000    1.173    2.004    1.791    1.335    1.376    3.049    1.108
 3000000    1.992    3.282    2.291    2.256    2.299    2.451    1.756
 4000000    2.576    4.135    3.130    2.960    2.937    5.022    3.318
 5000000    3.568    5.233    3.914    3.504    3.629    4.912    4.458
 6000000    4.237    6.233    4.710    4.323    4.514    5.066    3.876
 7000000    4.962    7.403    5.447    5.037    5.129    7.053    7.774
 8000000    5.854    8.696    6.151    5.963    5.908    8.704    5.836
 9000000    6.749    9.540    7.078    6.869    6.985    6.354    3.834
10000000    7.667   10.944    7.621    7.322    7.439   10.092    7.112
11000000    8.400   11.966    8.566    8.284    8.112   10.511    8.184
Total elapsed time: 23.84 minutes

My conclusions from single calls:

Thomas' select() and Tim's select7() as pure Python functions are too slow for serious contention. [Aside: I wonder how PyPy would go with them?] 

There's not much difference in performance between the various median-of-median-of-k functions for larger k, but it seems to me that overall k=47 is marginally faster than either k=23 or k=97.

Overall, sorting is as good or better (and usually *much better*) than any of the pure-Python functions for the values of N tested, at least on my computer. C versions may be worth testing, but I'm afraid that is beyond me. Thomas' select2 using dual pivots seems like the most promising.

There are a couple of anomalous results where select2 unexpectedly (to me!) does much, much better than sorting, e.g. for N=9 million. Pure chance perhaps?

The overall trend seems to me to suggest that a pure-Python version of select2 may become reliably faster than sorting from N=10 million or so, at least with random data on my computer. YMMV, and I would expect that will non-random partially sorted data, the results may be considerably different.

== Average of three calls mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.001    0.012    0.007    0.008    0.007    0.022    0.007
   10000    0.002    0.022    0.015    0.015    0.015    0.041    0.016
   50000    0.016    0.125    0.086    0.080    0.085    0.259    0.073
  100000    0.037    0.258    0.181    0.155    0.156    0.650    0.137
  500000    0.242    1.374    0.950    0.963    1.075    4.828    1.135
 1000000    0.564    2.892    1.998    1.952    2.100    5.055    1.721
 2000000    1.227    5.822    4.084    3.876    4.070   18.535    3.379
 3000000    2.034    8.825    6.264    6.256    5.798   29.206    4.851
 4000000    2.761   12.275    8.209    7.767    9.111   38.186    8.899
 5000000    3.587   14.829   10.289   10.385   10.685   53.101    8.149
 6000000    4.320   17.926   12.925   12.455   12.639   73.876   10.336
 7000000    5.237   21.504   15.221   14.740   16.167   87.315   12.254
 8000000    6.145   24.503   16.918   15.761   18.430  103.394   16.923
 9000000    6.947   26.801   19.993   18.755   20.676  106.303   16.444
10000000    8.113   30.933   21.352   20.341   20.417  102.421   16.987
11000000    9.031   33.912   24.676   23.624   22.448  114.279   18.698
Total elapsed time: 81.39 minutes

In this set of tests, each function is called three times on the same set of data. As expected, once the list is sorted on the first call, sorting it again on the second call is very fast, and so the "sort" column is quite similar to the previous set of tests.

What I didn't expect is just how badly the various other selection functions cope with being called three times on the same list with different ranks. The extreme case is Thomas' select() function. Total time to call it three times on a list of 11 million items is 342 seconds (3*114), compared to 10 seconds to call it once. I expected that having partially ordered the data on the first call, the second and third calls would take less time rather than more. Was I ever wrong. Unless my analysis is wrong, something bad is happening here, and I don't know what it is.

[Aside: this suggests that, unlike sort() which can take advantage of partially ordered data to be more efficient, the other selection functions are hurt by partially ordered data. Is this analogous to simple versions of Quicksort which degrade to O(N**2) if the data is already sorted?]

What is abundantly clear is that if you want to make more than one selection from a list, you ought to sort it first.

Given these results, I do not believe that a pure-python implementation of any of these selection algorithms can be justified on performance grounds for CPython.

Thanks to Tim Peters and Thomas Ahle for their valuable assistance in writing the selection functions in the first place.

----------
Added file: http://bugs.python.org/file35423/select.py

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue21592>
_______________________________________