[Numpy-discussion] faster (selection based) median, 2013 edition
jtaylor.debian at googlemail.com
Sat May 18 02:12:45 EDT 2013
once again I want to bring up the median algorithm which is implemented
in terms of sorting in numpy.
median (and percentile and a couple more functions) can be more
efficiently implemented in terms of a selection algorithm. The
complexity can them be linear instead of linearithmic.
I found numerous discussions of this in the list archives [1, 2, 3] but
I did not find why those attempts failed, the threads all just seemed to
Did the previous attempts fail due to lack of time or was there a
fundamental reason blocking this change?
In the hope of the former, I went ahead and implemented a prototype of a
partition function (similar to  but only one argument) and
implemented median in terms of it.
partition not like C++ partition, its equivalent to nth_element in C++,
maybe its better to name it nth_element?
The code is available here:
the partition interface is:
kth is an integer
The array is transformed so the k-th element of the array is in its
final sorted order, all below are smaller all above are greater, but the
ordering is undefined
In : d = np.arange(10); np.random.shuffle(d)
In : d
Out: array([1, 7, 0, 2, 5, 6, 8, 9, 3, 4])
In : np.partition(d, 3)
Out: array([0, 1, 2, 3, 4, 6, 8, 9, 7, 5])
In : _ == 3
the performance of median improves as expected:
old vs new, 5000, uniform shuffled, out of place:
100us vs 40us
old vs new, 50000, uniform shuffled, out of place:
1.12ms vs 0.265ms
old vs new, 500000, uniform shuffled, out of place:
14ms vs 2.81ms
The implementation is very much still a prototype, apartition is not
exposed (and only implemented as a quicksort) and there is only one
algorithm (quickselect). One could still add median of medians for
better worst case performance.
If no blockers appear I want to fix this up and file a pull request to
have this in numpy 1.8.
Guidance on details of implementation in numpys C api is highly
appreciated, its the first time I'm dealing with it.
More information about the NumPy-Discussion