[Numpy-discussion] quickselect

josef.pktd at gmail.com josef.pktd at gmail.com
Wed May 29 15:19:57 EDT 2013


On Wed, May 29, 2013 at 12:25 PM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On 05/29/2013 06:12 AM, josef.pktd at gmail.com wrote:
>>
>> On Tue, May 28, 2013 at 6:31 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>>>
>>> Hi All,
>>>
>>> There is a PR adding quickselect to numpy as a function `partition`.
>>> Comments on name and exposure in the numpy API are welcome.
>>
>>
>> I think the name is fine. It's possible to get used to it.
>>
>> One possible use I can think of is if I want to sort afterwards only
>> one of the tails (like largest 15%).
>>
>> Limitations:
>>
>> medium: like percentiles it would be nice to partition into more than
>> two sets, if that can be done more efficiently than calling the
>> function twice.
>
>
> I'm not aware of an algorithm that is more efficient to generate multiple
> partitions opposed to partitioning twice.
> Additional partitions can work on subranges so so smaller the remainign
> range the faster it gets.
> I did a coarse benchmark with percentile partitioning multiple times instead
> of sorting once.
> partition beats sorting even when you want to select 500 percentiles, so we
> probably can change it completely.
> If people frequently select even more we could add a heuristic cut off to
> sort instead.

I only know a few places in statistics where a medium large number of
percentiles or partitions are used, most of the time it's either just
a few or a full sort.
This would work well then.

Actually, I just thought about another application:
If I want to do a coarse quantization of an array by percentiles, then
I could partition it and assign categories, instead of doing a full
sort. The number of bins could be large in this case.

>
>
>>
>> major:
>> it won't work to calculate the median or other quantiles since those
>> interpolate
>> np.median: "When N is even, it is the average of the two middle values
>> of V_sorted"
>>
>
> this works fine, quickselect/partition moves the kth element into its final
> sorted order, so to get the median of even elements you can just partition
> twice, to get the two middle values. Then take the mean as usual.
>
> This is already implemented in the pull request.
> It includes the additional optimization that the second middle value is just
> a minimum of the second partition.
>

Thanks for the explanation. I was only looking for docstrings and
didn't see the implementation of median.
Using min on the second partitions looks like a nice trick, (one to remember).

As a future user, I think this sounds good, faster median and
percentiles and a nice new partition function.

Josef



More information about the NumPy-Discussion mailing list