Mailman 3 Proposal to add `weights` to `np.percentile` and `np.median` - NumPy-Discussion

Proposal to add `weights` to `np.percentile` and `np.median`

Joseph Fox-Rabinovitz

Feb. 16, 2016

5:49 a.m.

I would like to add a `weights` keyword to `np.partition`, `np.percentile` and `np.median`. My reason for doing so is to to allow `np.histogram` to process automatic bin selection with weights. Currently, weights are not supported for the automatic bin selection and would be difficult to support in `auto` mode without having `np.percentile` support a `weights` keyword. I suspect that there are many other uses for such a feature. I have taken a preliminary look at the C implementation of the partition functions that are the basis for `partition`, `median` and `percentile`. I think that it would be possible to add versions (or just extend the functionality of existing ones) that check the ratio of the weights below the partition point to the total sum of the weights instead of just counting elements. One of the main advantages of such an implementation is that it would allow any real weights to be handled correctly, not just integers. Complex weights would not be supported. The purpose of this email is to see if anybody objects, has ideas or cares at all about this proposal before I spend a significant amount of time working on it. For example, did I miss any functions in my list? Regards, -Joe

Show replies by date

Antony Lee

February 2016

6:32 p.m.

New subject: Proposal to add `weights` to `np.percentile` and `np.median`

See earlier discussion here: https://github.com/numpy/numpy/issues/6326 Basically, naïvely sorting may be faster than a not-so-optimized version of quickselect. Antony 2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz <jfoxrabinovitz@gmail.com>:

...

I would like to add a `weights` keyword to `np.partition`, `np.percentile` and `np.median`. My reason for doing so is to to allow `np.histogram` to process automatic bin selection with weights. Currently, weights are not supported for the automatic bin selection and would be difficult to support in `auto` mode without having `np.percentile` support a `weights` keyword. I suspect that there are many other uses for such a feature.

I have taken a preliminary look at the C implementation of the partition functions that are the basis for `partition`, `median` and `percentile`. I think that it would be possible to add versions (or just extend the functionality of existing ones) that check the ratio of the weights below the partition point to the total sum of the weights instead of just counting elements.

One of the main advantages of such an implementation is that it would allow any real weights to be handled correctly, not just integers. Complex weights would not be supported.

The purpose of this email is to see if anybody objects, has ideas or cares at all about this proposal before I spend a significant amount of time working on it. For example, did I miss any functions in my list?

Regards,

-Joe _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

Joseph Fox-Rabinovitz

6:41 p.m.

New subject: Proposal to add `weights` to `np.percentile` and `np.median`

Thanks for pointing me to that. I had something a bit different in mind but that definitely looks like a good start. On Tue, Feb 16, 2016 at 1:32 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

See earlier discussion here: https://github.com/numpy/numpy/issues/6326 Basically, naïvely sorting may be faster than a not-so-optimized version of quickselect.

Antony

2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz <jfoxrabinovitz@gmail.com>:

...
I would like to add a `weights` keyword to `np.partition`, `np.percentile` and `np.median`. My reason for doing so is to to allow `np.histogram` to process automatic bin selection with weights. Currently, weights are not supported for the automatic bin selection and would be difficult to support in `auto` mode without having `np.percentile` support a `weights` keyword. I suspect that there are many other uses for such a feature.

I have taken a preliminary look at the C implementation of the partition functions that are the basis for `partition`, `median` and `percentile`. I think that it would be possible to add versions (or just extend the functionality of existing ones) that check the ratio of the weights below the partition point to the total sum of the weights instead of just counting elements.

One of the main advantages of such an implementation is that it would allow any real weights to be handled correctly, not just integers. Complex weights would not be supported.

The purpose of this email is to see if anybody objects, has ideas or cares at all about this proposal before I spend a significant amount of time working on it. For example, did I miss any functions in my list?

Regards,

-Joe _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

7:39 p.m.

New subject: Proposal to add `weights` to `np.percentile` and `np.median`

On Tue, Feb 16, 2016 at 1:41 PM, Joseph Fox-Rabinovitz < jfoxrabinovitz@gmail.com> wrote:

...

Thanks for pointing me to that. I had something a bit different in mind but that definitely looks like a good start.

On Tue, Feb 16, 2016 at 1:32 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
See earlier discussion here: https://github.com/numpy/numpy/issues/6326 Basically, naïvely sorting may be faster than a not-so-optimized version of quickselect.

Antony

2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz < jfoxrabinovitz@gmail.com>:

...
I would like to add a `weights` keyword to `np.partition`, `np.percentile` and `np.median`. My reason for doing so is to to allow `np.histogram` to process automatic bin selection with weights. Currently, weights are not supported for the automatic bin selection and would be difficult to support in `auto` mode without having `np.percentile` support a `weights` keyword. I suspect that there are many other uses for such a feature.

I have taken a preliminary look at the C implementation of the partition functions that are the basis for `partition`, `median` and `percentile`. I think that it would be possible to add versions (or just extend the functionality of existing ones) that check the ratio of the weights below the partition point to the total sum of the weights instead of just counting elements.

One of the main advantages of such an implementation is that it would allow any real weights to be handled correctly, not just integers. Complex weights would not be supported.

The purpose of this email is to see if anybody objects, has ideas or cares at all about this proposal before I spend a significant amount of time working on it. For example, did I miss any functions in my list?

Regards,

-Joe _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

statsmodels just got weighted quantiles https://github.com/statsmodels/statsmodels/pull/2707 I didn't try to figure out it's computational efficiency, and we would gladly delegate to whatever fast algorithm would be in numpy. Josef

Joseph Fox-Rabinovitz

7:48 p.m.

New subject: Proposal to add `weights` to `np.percentile` and `np.median`

Please correct me if I misunderstood, but the code in that commit is doing a full sort, somewhat similar to what `scipy.stats.scoreatpercentile`. If that is correct, I will run some benchmarks first, but I think there is value to going forward with a numpy version that extends the current partitioning scheme. - Joe On Tue, Feb 16, 2016 at 2:39 PM, <josef.pktd@gmail.com> wrote:

...

On Tue, Feb 16, 2016 at 1:41 PM, Joseph Fox-Rabinovitz <jfoxrabinovitz@gmail.com> wrote:

...
Thanks for pointing me to that. I had something a bit different in mind but that definitely looks like a good start.

On Tue, Feb 16, 2016 at 1:32 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
See earlier discussion here: https://github.com/numpy/numpy/issues/6326 Basically, naïvely sorting may be faster than a not-so-optimized version of quickselect.

Antony

2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz <jfoxrabinovitz@gmail.com>:

...
I would like to add a `weights` keyword to `np.partition`, `np.percentile` and `np.median`. My reason for doing so is to to allow `np.histogram` to process automatic bin selection with weights. Currently, weights are not supported for the automatic bin selection and would be difficult to support in `auto` mode without having `np.percentile` support a `weights` keyword. I suspect that there are many other uses for such a feature.

I have taken a preliminary look at the C implementation of the partition functions that are the basis for `partition`, `median` and `percentile`. I think that it would be possible to add versions (or just extend the functionality of existing ones) that check the ratio of the weights below the partition point to the total sum of the weights instead of just counting elements.

One of the main advantages of such an implementation is that it would allow any real weights to be handled correctly, not just integers. Complex weights would not be supported.

The purpose of this email is to see if anybody objects, has ideas or cares at all about this proposal before I spend a significant amount of time working on it. For example, did I miss any functions in my list?

Regards,

-Joe _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

statsmodels just got weighted quantiles https://github.com/statsmodels/statsmodels/pull/2707

I didn't try to figure out it's computational efficiency, and we would gladly delegate to whatever fast algorithm would be in numpy.

Josef

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

8:22 p.m.

New subject: Proposal to add `weights` to `np.percentile` and `np.median`

On Tue, Feb 16, 2016 at 2:48 PM, Joseph Fox-Rabinovitz < jfoxrabinovitz@gmail.com> wrote:

...

Please correct me if I misunderstood, but the code in that commit is doing a full sort, somewhat similar to what `scipy.stats.scoreatpercentile`. If that is correct, I will run some benchmarks first, but I think there is value to going forward with a numpy version that extends the current partitioning scheme.

I think so, but it's hiding inside pandas groupby, which also uses a hash, IIUC. AFAICS, the main reason it's implemented this way is to get correct tie handling. There could be large performance differences depending on whether there are many ties (discretized data) or only unique floats. (just guessing) Josef

...

- Joe

On Tue, Feb 16, 2016 at 2:39 PM, <josef.pktd@gmail.com> wrote:

...
On Tue, Feb 16, 2016 at 1:41 PM, Joseph Fox-Rabinovitz <jfoxrabinovitz@gmail.com> wrote:

...
Thanks for pointing me to that. I had something a bit different in mind but that definitely looks like a good start.

On Tue, Feb 16, 2016 at 1:32 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
See earlier discussion here:

https://github.com/numpy/numpy/issues/6326

...
...
...
Basically, naïvely sorting may be faster than a not-so-optimized version of quickselect.

Antony

2016-02-15 21:49 GMT-08:00 Joseph Fox-Rabinovitz <jfoxrabinovitz@gmail.com>:

...
I would like to add a `weights` keyword to `np.partition`, `np.percentile` and `np.median`. My reason for doing so is to to

allow

...
`np.histogram` to process automatic bin selection with weights. Currently, weights are not supported for the automatic bin selection and would be difficult to support in `auto` mode without having `np.percentile` support a `weights` keyword. I suspect that there are many other uses for such a feature.

I have taken a preliminary look at the C implementation of the partition functions that are the basis for `partition`, `median` and `percentile`. I think that it would be possible to add versions (or just extend the functionality of existing ones) that check the ratio of the weights below the partition point to the total sum of the weights instead of just counting elements.

One of the main advantages of such an implementation is that it would allow any real weights to be handled correctly, not just integers. Complex weights would not be supported.

The purpose of this email is to see if anybody objects, has ideas or cares at all about this proposal before I spend a significant amount of time working on it. For example, did I miss any functions in my list?

Regards,

-Joe _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

statsmodels just got weighted quantiles https://github.com/statsmodels/statsmodels/pull/2707

I didn't try to figure out it's computational efficiency, and we would gladly delegate to whatever fast algorithm would be in numpy.

Josef

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

3311

Age (days ago)

3311

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Antony Lee
josef.pktd＠gmail.com
Joseph Fox-Rabinovitz

Proposal to add `weights` to `np.percentile` and `np.median`

Joseph Fox-Rabinovitz

Antony Lee

Joseph Fox-Rabinovitz

josef.pktd＠gmail.com

Joseph Fox-Rabinovitz

josef.pktd＠gmail.com

tags

participants (3)