![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
Hi All, There is a PR <https://github.com/numpy/numpy/pull/3360> adding quickselect to numpy as a function `partition`. Comments on name and exposure in the numpy API are welcome. Chuck
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, May 28, 2013 at 6:31 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
I think the name is fine. It's possible to get used to it. One possible use I can think of is if I want to sort afterwards only one of the tails (like largest 15%). Limitations: medium: like percentiles it would be nice to partition into more than two sets, if that can be done more efficiently than calling the function twice. major: it won't work to calculate the median or other quantiles since those interpolate np.median: "When N is even, it is the average of the two middle values of V_sorted" AFAICS based on the docstring (I closed my eyes when the c code scrolled by) Josef
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 05/29/2013 06:12 AM, josef.pktd@gmail.com wrote:
I'm not aware of an algorithm that is more efficient to generate multiple partitions opposed to partitioning twice. Additional partitions can work on subranges so so smaller the remainign range the faster it gets. I did a coarse benchmark with percentile partitioning multiple times instead of sorting once. partition beats sorting even when you want to select 500 percentiles, so we probably can change it completely. If people frequently select even more we could add a heuristic cut off to sort instead.
this works fine, quickselect/partition moves the kth element into its final sorted order, so to get the median of even elements you can just partition twice, to get the two middle values. Then take the mean as usual. This is already implemented in the pull request. It includes the additional optimization that the second middle value is just a minimum of the second partition.
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, May 29, 2013 at 12:25 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
I only know a few places in statistics where a medium large number of percentiles or partitions are used, most of the time it's either just a few or a full sort. This would work well then. Actually, I just thought about another application: If I want to do a coarse quantization of an array by percentiles, then I could partition it and assign categories, instead of doing a full sort. The number of bins could be large in this case.
Thanks for the explanation. I was only looking for docstrings and didn't see the implementation of median. Using min on the second partitions looks like a nice trick, (one to remember). As a future user, I think this sounds good, faster median and percentiles and a nice new partition function. Josef
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, May 29, 2013 at 3:19 PM, <josef.pktd@gmail.com> wrote:
Advertising for an other possible application for ``partition`` I'm just looking at trimmed means, where we want to remove the 10% or 20% largest and smallest values in an array, and base the mean on the center 80% or 60% observations. median is a special case with almost 50% trimming. used for outlier detection and robust statistic (t-test with skewed and heavy-tailed distributions) This requires currently a full sort. Josef
Josef
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 09.06.2013 12:10, josef.pktd@gmail.com wrote:
here a a quick status report on the PR https://github.com/numpy/numpy/pull/3360 I now implemented partitioning via the introselect algorithm which is a quickselect median of 3 pivot with a cutoff on recursion depth to a median of median of 5 pivot for O(N) worst case complexity. Additionally it can stores its pivots for reuse in other partitions on the same data to reduce the space required to be partitioned next time, this is useful e.g. for multiple percentiles. It is functionally ready, but there are still some API/ABI issues to be solved. Mainly deciding if we put the selection algorithms in the ABI for 1.8 or not. Currently the ABI is unchanged so user types cannot make use of the algorithms (they will fall back to quicksort). The python api is now: np.partition(array, kth) where kth is an integer or array of integers it will move each index in kth into its final sorted position, so np.partition(a, range(a.size)) results in a (inefficient) sort. e.g.: d = np.array([66, 81, 21, 75, 46, -6, 66, 86, 242, 47, 88, 79]) np.partition(d, (2, -2)) # (2, 8) array([ -6, 21, 46, 47, 75, 66, 66, 79, 81, 86, 88, 242]) Multidimensional arrays will use the same array of kth, you cannot partition each axis by different values, you would have to explicitly loop to do that. Median is implemented in terms of partitioning already, but percentile is not. I would suggest someone else than me gives a try at implementing percentile in terms of partition to see if the documentation and api make sense to others.
![](https://secure.gravatar.com/avatar/cd71bb17a5ef04a06383cdcde1f370ad.jpg?s=120&d=mm&r=g)
On Jun 9, 2013, at 11:27 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
Julian, Since I am the author of the current percentile PR (https://github.com/numpy/numpy/pull/2970), I'm willing to try reimplementing percentile with the new partition functionality. I don't expect to have time to do this until the Scipy conference in two week, so if anyone else wants to give the implementation a try please feel free. Julian will you be at Scipy this year if I have any questions? Cheers, - Jonathan Helmus
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 11.06.2013 14:37, Jonathan J. Helmus wrote:
I wont be at the Scipy in June, but I can be reached via email or IRC (jtaylor on freenode and oftc) in the evening (UTC). btw. you don't need the minimum "trick" currently used in the branch for even element medians any more, so don't copy it for percentile. partition is specialized for small kth elements to be almost as fast as minimum, sometimes even faster if iterative partitioning is used (e.g. np.partition(d, (mid, mid + 1)))
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
Hi, a selection algorithm [0] has now landed in the numpy development branch [1]. The function exposing it is: numpy.partition(data, kth=int/array, axis=-1, kind="introselect", order=None) Please see the docstrings on what it actually does (and report if they are confusing). Thanks to the numpy developers for the review and thanks to all who provided comments. If you have a program which might benefit from using selection instead of sorting please try it out and report if you are happy with its result and the api. As first function in numpy median has been converted to use partition so it now scales linear with the datasize instead of linearithmic. You can probably expect a five times speedup for array sizes around 1e6. But this also involves a slight change in behavior for the case where you use overwrite_input=True In the past this sorted the full array, now it will only be partially sorted. That the array is sorted was always documented as an implementation detail so hopefully that won't break any code. The next function to be adapted will most likely be numpy.percentile. [0] http://en.wikipedia.org/wiki/Selection_algorithm [1] https://github.com/numpy/numpy/pull/3360
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
Hi Julian On Mon, Aug 12, 2013 at 4:23 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
This looks great, thanks very much! A minor bug was introduced into the Bento build: https://github.com/numpy/numpy/pull/3606 https://github.com/numpy/numpy/pull/3607 Stéfan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, May 28, 2013 at 6:31 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
I think the name is fine. It's possible to get used to it. One possible use I can think of is if I want to sort afterwards only one of the tails (like largest 15%). Limitations: medium: like percentiles it would be nice to partition into more than two sets, if that can be done more efficiently than calling the function twice. major: it won't work to calculate the median or other quantiles since those interpolate np.median: "When N is even, it is the average of the two middle values of V_sorted" AFAICS based on the docstring (I closed my eyes when the c code scrolled by) Josef
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 05/29/2013 06:12 AM, josef.pktd@gmail.com wrote:
I'm not aware of an algorithm that is more efficient to generate multiple partitions opposed to partitioning twice. Additional partitions can work on subranges so so smaller the remainign range the faster it gets. I did a coarse benchmark with percentile partitioning multiple times instead of sorting once. partition beats sorting even when you want to select 500 percentiles, so we probably can change it completely. If people frequently select even more we could add a heuristic cut off to sort instead.
this works fine, quickselect/partition moves the kth element into its final sorted order, so to get the median of even elements you can just partition twice, to get the two middle values. Then take the mean as usual. This is already implemented in the pull request. It includes the additional optimization that the second middle value is just a minimum of the second partition.
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, May 29, 2013 at 12:25 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
I only know a few places in statistics where a medium large number of percentiles or partitions are used, most of the time it's either just a few or a full sort. This would work well then. Actually, I just thought about another application: If I want to do a coarse quantization of an array by percentiles, then I could partition it and assign categories, instead of doing a full sort. The number of bins could be large in this case.
Thanks for the explanation. I was only looking for docstrings and didn't see the implementation of median. Using min on the second partitions looks like a nice trick, (one to remember). As a future user, I think this sounds good, faster median and percentiles and a nice new partition function. Josef
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, May 29, 2013 at 3:19 PM, <josef.pktd@gmail.com> wrote:
Advertising for an other possible application for ``partition`` I'm just looking at trimmed means, where we want to remove the 10% or 20% largest and smallest values in an array, and base the mean on the center 80% or 60% observations. median is a special case with almost 50% trimming. used for outlier detection and robust statistic (t-test with skewed and heavy-tailed distributions) This requires currently a full sort. Josef
Josef
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 09.06.2013 12:10, josef.pktd@gmail.com wrote:
here a a quick status report on the PR https://github.com/numpy/numpy/pull/3360 I now implemented partitioning via the introselect algorithm which is a quickselect median of 3 pivot with a cutoff on recursion depth to a median of median of 5 pivot for O(N) worst case complexity. Additionally it can stores its pivots for reuse in other partitions on the same data to reduce the space required to be partitioned next time, this is useful e.g. for multiple percentiles. It is functionally ready, but there are still some API/ABI issues to be solved. Mainly deciding if we put the selection algorithms in the ABI for 1.8 or not. Currently the ABI is unchanged so user types cannot make use of the algorithms (they will fall back to quicksort). The python api is now: np.partition(array, kth) where kth is an integer or array of integers it will move each index in kth into its final sorted position, so np.partition(a, range(a.size)) results in a (inefficient) sort. e.g.: d = np.array([66, 81, 21, 75, 46, -6, 66, 86, 242, 47, 88, 79]) np.partition(d, (2, -2)) # (2, 8) array([ -6, 21, 46, 47, 75, 66, 66, 79, 81, 86, 88, 242]) Multidimensional arrays will use the same array of kth, you cannot partition each axis by different values, you would have to explicitly loop to do that. Median is implemented in terms of partitioning already, but percentile is not. I would suggest someone else than me gives a try at implementing percentile in terms of partition to see if the documentation and api make sense to others.
![](https://secure.gravatar.com/avatar/cd71bb17a5ef04a06383cdcde1f370ad.jpg?s=120&d=mm&r=g)
On Jun 9, 2013, at 11:27 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
Julian, Since I am the author of the current percentile PR (https://github.com/numpy/numpy/pull/2970), I'm willing to try reimplementing percentile with the new partition functionality. I don't expect to have time to do this until the Scipy conference in two week, so if anyone else wants to give the implementation a try please feel free. Julian will you be at Scipy this year if I have any questions? Cheers, - Jonathan Helmus
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 11.06.2013 14:37, Jonathan J. Helmus wrote:
I wont be at the Scipy in June, but I can be reached via email or IRC (jtaylor on freenode and oftc) in the evening (UTC). btw. you don't need the minimum "trick" currently used in the branch for even element medians any more, so don't copy it for percentile. partition is specialized for small kth elements to be almost as fast as minimum, sometimes even faster if iterative partitioning is used (e.g. np.partition(d, (mid, mid + 1)))
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
Hi, a selection algorithm [0] has now landed in the numpy development branch [1]. The function exposing it is: numpy.partition(data, kth=int/array, axis=-1, kind="introselect", order=None) Please see the docstrings on what it actually does (and report if they are confusing). Thanks to the numpy developers for the review and thanks to all who provided comments. If you have a program which might benefit from using selection instead of sorting please try it out and report if you are happy with its result and the api. As first function in numpy median has been converted to use partition so it now scales linear with the datasize instead of linearithmic. You can probably expect a five times speedup for array sizes around 1e6. But this also involves a slight change in behavior for the case where you use overwrite_input=True In the past this sorted the full array, now it will only be partially sorted. That the array is sorted was always documented as an implementation detail so hopefully that won't break any code. The next function to be adapted will most likely be numpy.percentile. [0] http://en.wikipedia.org/wiki/Selection_algorithm [1] https://github.com/numpy/numpy/pull/3360
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
Hi Julian On Mon, Aug 12, 2013 at 4:23 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
This looks great, thanks very much! A minor bug was introduced into the Bento build: https://github.com/numpy/numpy/pull/3606 https://github.com/numpy/numpy/pull/3607 Stéfan
participants (5)
-
Charles R Harris
-
Jonathan J. Helmus
-
josef.pktd@gmail.com
-
Julian Taylor
-
Stéfan van der Walt