[Numpy-discussion] faster (selection based) median, 2013 edition

Julian Taylor jtaylor.debian at googlemail.com
Sat May 18 02:12:45 EDT 2013


hi,

once again I want to bring up the median algorithm which is implemented
in terms of sorting in numpy.
median (and percentile and a couple more functions) can be more
efficiently implemented in terms of a selection algorithm. The
complexity can them be linear instead of linearithmic.

I found numerous discussions of this in the list archives [1, 2, 3] but
I did not find why those attempts failed, the threads all just seemed to
stop.
Did the previous attempts fail due to lack of time or was there a
fundamental reason blocking this change?

In the hope of the former, I went ahead and implemented a prototype of a
partition function (similar to [3] but only one argument) and
implemented median in terms of it.
partition not like C++ partition, its equivalent to nth_element in C++,
maybe its better to name it nth_element?

The code is available here:
https://github.com/juliantaylor/numpy/tree/select-median

the partition interface is:
ndarray.partition(kth, axis=-1)
kth is an integer
The array is transformed so the k-th element of the array is in its
final sorted order, all below are smaller all above are greater, but the
ordering is undefined

Example:
In [1]: d = np.arange(10); np.random.shuffle(d)
In [2]: d
Out[2]: array([1, 7, 0, 2, 5, 6, 8, 9, 3, 4])
In [3]: np.partition(d, 3)
Out[3]: array([0, 1, 2, 3, 4, 6, 8, 9, 7, 5])
In [4]:  _[3] == 3
Out[5]: True

the performance of median improves as expected:
old vs new, 5000, uniform shuffled, out of place:
100us vs 40us
old vs new, 50000, uniform shuffled, out of place:
1.12ms vs 0.265ms
old vs new, 500000, uniform shuffled, out of place:
14ms vs 2.81ms

The implementation is very much still a prototype, apartition is not
exposed (and only implemented as a quicksort) and there is only one
algorithm (quickselect). One could still add median of medians for
better worst case performance.

If no blockers appear I want to fix this up and file a pull request to
have this in numpy 1.8.
Guidance on details of implementation in numpys C api is highly
appreciated, its the first time I'm dealing with it.

Cheers,
Julian Taylor

[1]
http://thread.gmane.org/gmane.comp.python.numeric.general/50931/focus=50941
[2]
http://thread.gmane.org/gmane.comp.python.numeric.general/32507/focus=41716
[3]
http://thread.gmane.org/gmane.comp.python.numeric.general/32341/focus=32348



More information about the NumPy-Discussion mailing list