Re: [Numpy-discussion] quickselect

9 Jun 2013

      On Wed, May 29, 2013 at 3:19 PM,  <josef.pktd@gmail.com> wrote:
...
On Wed, May 29, 2013 at 12:25 PM, Julian Taylor
<jtaylor.debian@googlemail.com> wrote:
...
On 05/29/2013 06:12 AM, josef.pktd@gmail.com wrote:
...
On Tue, May 28, 2013 at 6:31 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
...
Hi All,
There is a PR adding quickselect to numpy as a function `partition`.
Comments on name and exposure in the numpy API are welcome.
I think the name is fine. It's possible to get used to it.
One possible use I can think of is if I want to sort afterwards only
one of the tails (like largest 15%).
Limitations:
medium: like percentiles it would be nice to partition into more than
two sets, if that can be done more efficiently than calling the
function twice.
I'm not aware of an algorithm that is more efficient to generate multiple
partitions opposed to partitioning twice.
Additional partitions can work on subranges so so smaller the remainign
range the faster it gets.
I did a coarse benchmark with percentile partitioning multiple times instead
of sorting once.
partition beats sorting even when you want to select 500 percentiles, so we
probably can change it completely.
If people frequently select even more we could add a heuristic cut off to
sort instead.
I only know a few places in statistics where a medium large number of
percentiles or partitions are used, most of the time it's either just
a few or a full sort.
This would work well then.
Actually, I just thought about another application:
If I want to do a coarse quantization of an array by percentiles, then
I could partition it and assign categories, instead of doing a full
sort. The number of bins could be large in this case.
...
...
major:
it won't work to calculate the median or other quantiles since those
interpolate
np.median: "When N is even, it is the average of the two middle values
of V_sorted"
this works fine, quickselect/partition moves the kth element into its final
sorted order, so to get the median of even elements you can just partition
twice, to get the two middle values. Then take the mean as usual.
This is already implemented in the pull request.
It includes the additional optimization that the second middle value is just
a minimum of the second partition.
Thanks for the explanation. I was only looking for docstrings and
didn't see the implementation of median.
Using min on the second partitions looks like a nice trick, (one to remember).
As a future user, I think this sounds good, faster median and
percentiles and a nice new partition function.
Advertising for an other possible application for ``partition``

I'm just looking at trimmed means, where we want to remove the 10% or
20% largest and smallest values in an array, and base the mean on the
center 80% or 60% observations.

median is a special case with almost 50% trimming.

used for outlier detection and robust statistic (t-test with skewed
and heavy-tailed distributions)

This requires currently a full sort.

Josef
...
Josef

Re: [Numpy-discussion] quickselect

josef.pktd＠gmail.com