[Numpy-discussion] quickselect

Sun Jun 9 06:10:54 EDT 2013

On Wed, May 29, 2013 at 3:19 PM,  <josef.pktd at gmail.com> wrote:
> On Wed, May 29, 2013 at 12:25 PM, Julian Taylor
> <jtaylor.debian at googlemail.com> wrote:
>> On 05/29/2013 06:12 AM, josef.pktd at gmail.com wrote:
>>>
>>> On Tue, May 28, 2013 at 6:31 PM, Charles R Harris
>>> <charlesr.harris at gmail.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> There is a PR adding quickselect to numpy as a function `partition`.
>>>> Comments on name and exposure in the numpy API are welcome.
>>>
>>>
>>> I think the name is fine. It's possible to get used to it.
>>>
>>> One possible use I can think of is if I want to sort afterwards only
>>> one of the tails (like largest 15%).
>>>
>>> Limitations:
>>>
>>> medium: like percentiles it would be nice to partition into more than
>>> two sets, if that can be done more efficiently than calling the
>>> function twice.
>>
>>
>> I'm not aware of an algorithm that is more efficient to generate multiple
>> partitions opposed to partitioning twice.
>> Additional partitions can work on subranges so so smaller the remainign
>> range the faster it gets.
>> I did a coarse benchmark with percentile partitioning multiple times instead
>> of sorting once.
>> partition beats sorting even when you want to select 500 percentiles, so we
>> probably can change it completely.
>> If people frequently select even more we could add a heuristic cut off to
>> sort instead.
>
> I only know a few places in statistics where a medium large number of
> percentiles or partitions are used, most of the time it's either just
> a few or a full sort.
> This would work well then.
>
> Actually, I just thought about another application:
> If I want to do a coarse quantization of an array by percentiles, then
> I could partition it and assign categories, instead of doing a full
> sort. The number of bins could be large in this case.
>
>>
>>
>>>
>>> major:
>>> it won't work to calculate the median or other quantiles since those
>>> interpolate
>>> np.median: "When N is even, it is the average of the two middle values
>>> of V_sorted"
>>>
>>
>> this works fine, quickselect/partition moves the kth element into its final
>> sorted order, so to get the median of even elements you can just partition
>> twice, to get the two middle values. Then take the mean as usual.
>>
>> This is already implemented in the pull request.
>> It includes the additional optimization that the second middle value is just
>> a minimum of the second partition.
>>
>
> Thanks for the explanation. I was only looking for docstrings and
> didn't see the implementation of median.
> Using min on the second partitions looks like a nice trick, (one to remember).
>
> As a future user, I think this sounds good, faster median and
> percentiles and a nice new partition function.

Advertising for an other possible application for ``partition``

I'm just looking at trimmed means, where we want to remove the 10% or
20% largest and smallest values in an array, and base the mean on the
center 80% or 60% observations.

median is a special case with almost 50% trimming.

used for outlier detection and robust statistic (t-test with skewed
and heavy-tailed distributions)

This requires currently a full sort.

Josef

>
> Josef