hi, once again I want to bring up the median algorithm which is implemented in terms of sorting in numpy. median (and percentile and a couple more functions) can be more efficiently implemented in terms of a selection algorithm. The complexity can them be linear instead of linearithmic. I found numerous discussions of this in the list archives [1, 2, 3] but I did not find why those attempts failed, the threads all just seemed to stop. Did the previous attempts fail due to lack of time or was there a fundamental reason blocking this change? In the hope of the former, I went ahead and implemented a prototype of a partition function (similar to [3] but only one argument) and implemented median in terms of it. partition not like C++ partition, its equivalent to nth_element in C++, maybe its better to name it nth_element? The code is available here: https://github.com/juliantaylor/numpy/tree/select-median the partition interface is: ndarray.partition(kth, axis=-1) kth is an integer The array is transformed so the k-th element of the array is in its final sorted order, all below are smaller all above are greater, but the ordering is undefined Example: In [1]: d = np.arange(10); np.random.shuffle(d) In [2]: d Out[2]: array([1, 7, 0, 2, 5, 6, 8, 9, 3, 4]) In [3]: np.partition(d, 3) Out[3]: array([0, 1, 2, 3, 4, 6, 8, 9, 7, 5]) In [4]: _[3] == 3 Out[5]: True the performance of median improves as expected: old vs new, 5000, uniform shuffled, out of place: 100us vs 40us old vs new, 50000, uniform shuffled, out of place: 1.12ms vs 0.265ms old vs new, 500000, uniform shuffled, out of place: 14ms vs 2.81ms The implementation is very much still a prototype, apartition is not exposed (and only implemented as a quicksort) and there is only one algorithm (quickselect). One could still add median of medians for better worst case performance. If no blockers appear I want to fix this up and file a pull request to have this in numpy 1.8. Guidance on details of implementation in numpys C api is highly appreciated, its the first time I'm dealing with it. Cheers, Julian Taylor [1] http://thread.gmane.org/gmane.comp.python.numeric.general/50931/focus=50941 [2] http://thread.gmane.org/gmane.comp.python.numeric.general/32507/focus=41716 [3] http://thread.gmane.org/gmane.comp.python.numeric.general/32341/focus=32348
On 18.05.2013 08:12, Julian Taylor wrote:
hi,
once again I want to bring up the median algorithm which is implemented in terms of sorting in numpy. median (and percentile and a couple more functions) can be more efficiently implemented in terms of a selection algorithm. The complexity can them be linear instead of linearithmic.
I filed a work in progress pull request with the code, it now includes argpartition and crude docstrings: https://github.com/numpy/numpy/pull/3360 please give it a try.
participants (1)
-
Julian Taylor