Mailman 3 Pull Request Review: R-like sample function - NumPy-Discussion

newer
Hiring a junior developer on the...

Pull Request Review: R-like sample function

older
PyBUF_SIMPLE/PyBUF_FORMAT: casts...

Christopher Jordan-Squire

1 Sep 2011 1 Sep '11

6:02 p.m.

Hi--I've just submitted a numpy 2.0 pull request for a function sample in np.random. It's essentially an implementation of R's sample function. It allows possibly non-uniform, possibly without-replacement sampling from a given 1-D array-like. This is very useful for quickly and cleanly creating samples from, for example, a list of strings or a list of non-contiguous, non-evenly spaced integers. Both occur in data analysis with categorical data. It is, essentially, a convenience function that wraps a number of existing ways to take a random sample. I think it belongs in numpy.random rather than scipy.stats because it's just a random sampler, rather than a probability distribution. It isn't possible to define a scipy.stats discrete random variable on strings--it would have to instead be done on the indices of the list containing the possible samples. And (as far as I can tell) the scipy.stats distributions can't be used for sampling without replacement. https://github.com/numpy/numpy/pull/151 -Chris JS

Show replies by date

josef.pktd＠gmail.com

1 Sep 1 Sep

10:01 p.m.

On Thu, Sep 1, 2011 at 6:02 PM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

Hi--I've just submitted a numpy 2.0 pull request for a function sample in np.random. It's essentially an implementation of R's sample function. It allows possibly non-uniform, possibly without-replacement sampling from a given 1-D array-like. This is very useful for quickly and cleanly creating samples from, for example, a list of strings or a list of non-contiguous, non-evenly spaced integers. Both occur in data analysis with categorical data.

It is, essentially, a convenience function that wraps a number of existing ways to take a random sample. I think it belongs in numpy.random rather than scipy.stats because it's just a random sampler, rather than a probability distribution. It isn't possible to define a scipy.stats discrete random variable on strings--it would have to instead be done on the indices of the list containing the possible samples. And (as far as I can tell) the scipy.stats distributions can't be used for sampling without replacement.

https://github.com/numpy/numpy/pull/151

I don't think you can kill numpy.random.random and similar mixed in with an adding a new function commit. First these functions would need to be deprecated. "it does not break the API as the previous function was not in the docs" This is a doc bug, I assume. I don't think it means users/developers don't rely on it. searching for np.random.random shows 120 threads in my gmail reader, python uses random.random() dir(np.random) shows it I copied it from mailing list examples. It's used quite a bit in scipy, as I saw because of your work. I also find the historical multiplicity of aliases confusing, but which names should be deprecated would at least require a discussion and a separate commit. Josef

...

-Chris JS _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Christopher Jordan-Squire

10:39 p.m.

On Thu, Sep 1, 2011 at 10:01 PM, <josef.pktd@gmail.com> wrote:

...

On Thu, Sep 1, 2011 at 6:02 PM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
Hi--I've just submitted a numpy 2.0 pull request for a function sample in np.random. It's essentially an implementation of R's sample function. It allows possibly non-uniform, possibly without-replacement sampling from a given 1-D array-like. This is very useful for quickly and cleanly creating samples from, for example, a list of strings or a list of non-contiguous, non-evenly spaced integers. Both occur in data analysis with categorical data.

It is, essentially, a convenience function that wraps a number of existing ways to take a random sample. I think it belongs in numpy.random rather than scipy.stats because it's just a random sampler, rather than a probability distribution. It isn't possible to define a scipy.stats discrete random variable on strings--it would have to instead be done on the indices of the list containing the possible samples. And (as far as I can tell) the scipy.stats distributions can't be used for sampling without replacement.

https://github.com/numpy/numpy/pull/151

I don't think you can kill numpy.random.random and similar mixed in with an adding a new function commit.

Killjoy.

...

First these functions would need to be deprecated.

I discussed this with a few other people, and they suggested that it could be alright since it's for numpy 2.0 rather than numpy 1.x. For the 2.0 version it would be perfectly reasonable to have a break with the API. (Though, as I said, it's not a break with the API.)

...

"it does not break the API as the previous function was not in the docs"

This is a doc bug, I assume. I don't think it means users/developers don't rely on it.

You apparently don't subscribe to the view that the API is infallible revelation revealed. (That's a joke, if it's not obvious.)

...

searching for np.random.random shows 120 threads in my gmail reader, python uses random.random() dir(np.random) shows it I copied it from mailing list examples. It's used quite a bit in scipy, as I saw because of your work.

I also find the historical multiplicity of aliases confusing, but which names should be deprecated would at least require a discussion and a separate commit.

Josef

I hadn't thought about the random.random connection. I'm fine with leaving random.random as an alias for random.random_sample. I just wanted to claim random.sample for my own function. I can't think of many other instances of aliased functions like that in numpy, though--but perhaps I'm not thinking hard enough. It certainly seemed strange to have 4 names for the same function. Now that you mention the standard library connection, my use of random.sample seems more in line with the standard library random package than the alias to random.random_sample. Though that would suggest using the default replace=True, which I'd prefer not to do. -Chris JS

...

...
-Chris JS _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

10:48 p.m.

On Thu, Sep 1, 2011 at 21:39, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

On Thu, Sep 1, 2011 at 10:01 PM, <josef.pktd@gmail.com> wrote:

...

...
First these functions would need to be deprecated.

I discussed this with a few other people, and they suggested that it could be alright since it's for numpy 2.0 rather than numpy 1.x. For the 2.0 version it would be perfectly reasonable to have a break with the API. (Though, as I said, it's not a break with the API.)

Yes it is. A very long-standing API. The fact that you had to go remove a number of actual uses of the aliases should have told you this. The documentation is not the API. You cannot remove these aliases without a deprecation period lasting one full minor release. 2.0 is not license to make backwards-incompatible changes solely for aesthetic reasons. There is no reason not to follow the standard deprecation schedule here.

...

I can't think of many other instances of aliased functions like that in numpy, though--but perhaps I'm not thinking hard enough. It certainly seemed strange to have 4 names for the same function.

numpy.random was actually replacing multiple libraries at once. The aliases kind of accreted. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Christopher Jordan-Squire

11:07 p.m.

On Thu, Sep 1, 2011 at 10:48 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

On Thu, Sep 1, 2011 at 21:39, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
On Thu, Sep 1, 2011 at 10:01 PM, <josef.pktd@gmail.com> wrote:

...
...
First these functions would need to be deprecated.

I discussed this with a few other people, and they suggested that it could be alright since it's for numpy 2.0 rather than numpy 1.x. For the 2.0 version it would be perfectly reasonable to have a break with the API. (Though, as I said, it's not a break with the API.)

Yes it is. A very long-standing API. The fact that you had to go remove a number of actual uses of the aliases should have told you this. The documentation is not the API. You cannot remove these aliases without a deprecation period lasting one full minor release. 2.0 is not license to make backwards-incompatible changes solely for aesthetic reasons. There is no reason not to follow the standard deprecation schedule here.

Then I was misinformed, and I hadn't realized it was a long-standing use. I suspected it might be problematic for np.random.random. But np.random.sample was only used once, and np.random.ranf not at all. Those two didn't appear in the statsmodels code base either, I don't think. (And random.random only appeared once in a docstring.) So in the mean time, are there any suggestions for what this R sample function should be called, since random.sample is apparently taken? -Chris

...

...
I can't think of many other instances of aliased functions like that in numpy, though--but perhaps I'm not thinking hard enough. It certainly seemed strange to have 4 names for the same function.

numpy.random was actually replacing multiple libraries at once. The aliases kind of accreted.

-- Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

11:14 p.m.

On Thu, Sep 1, 2011 at 22:07, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

So in the mean time, are there any suggestions for what this R sample function should be called, since random.sample is apparently taken?

If you default to size=1 (which you probably should anyways), then np.random.choice() makes sense, in analogy to random.choice() from the standard library. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Christopher Jordan-Squire

11:31 p.m.

On Thu, Sep 1, 2011 at 11:14 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

On Thu, Sep 1, 2011 at 22:07, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
So in the mean time, are there any suggestions for what this R sample function should be called, since random.sample is apparently taken?

If you default to size=1 (which you probably should anyways), then np.random.choice() makes sense, in analogy to random.choice() from the standard library.

Alright. I can make that change tomorrow. I'd prefer np.sample in the long-run, for compatibility with R. (False friends are loathsome things.) How does one petition to get function names deprecated? -Chris JS

...

-- Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

11:38 p.m.

On Thu, Sep 1, 2011 at 22:31, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

On Thu, Sep 1, 2011 at 11:14 PM, Robert Kern <robert.kern@gmail.com> wrote:

...
On Thu, Sep 1, 2011 at 22:07, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
So in the mean time, are there any suggestions for what this R sample function should be called, since random.sample is apparently taken?

If you default to size=1 (which you probably should anyways), then np.random.choice() makes sense, in analogy to random.choice() from the standard library.

Alright. I can make that change tomorrow. I'd prefer np.sample in the long-run, for compatibility with R. (False friends are loathsome things.)

We have a number of software packages we might want to be compatible with and a limited English vocabulary. There is always going to be some overlap with someone that is misleading in some contexts. It's a concern, but not an overriding one.

...

How does one petition to get function names deprecated?

Ask here. Obtain general agreement. Write a patch. Not necessarily in that order. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Nathaniel Smith

11:55 p.m.

On Thu, Sep 1, 2011 at 8:31 PM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

On Thu, Sep 1, 2011 at 11:14 PM, Robert Kern <robert.kern@gmail.com> wrote:

...
On Thu, Sep 1, 2011 at 22:07, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
So in the mean time, are there any suggestions for what this R sample function should be called, since random.sample is apparently taken?

If you default to size=1 (which you probably should anyways), then np.random.choice() makes sense, in analogy to random.choice() from the standard library.

Alright. I can make that change tomorrow. I'd prefer np.sample in the long-run, for compatibility with R. (False friends are loathsome things.) How does one petition to get function names deprecated?

I was about to argue that "random.choice" was a better name anyway, but then I remembered that the standard library "random.sample" exists and does something similar. So instead I'd like to argue that making this compatible with Python is more important than making it compatible with R :-). Specifically, 'np.random.sample(array_like, k)', with no further arguments, should perform random sampling *without* replacement. Ideally it should also efficiently handle the case where array_like is an xrange object. Docs are here: http://docs.python.org/library/random.html -- Nathaniel

Christopher Jordan-Squire

2 Sep 2 Sep

1:14 p.m.

I made the changes discussed here and pushed them to pull request. https://github.com/numpy/numpy/pull/143#issuecomment-1980897 I changed the new function's name from sample to choice and added the size=1 default as Robert suggested. I also reverted all the changes for sample, random, and ranf and added them to the reference docs. -Chris JS On Thu, Sep 1, 2011 at 10:55 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Thu, Sep 1, 2011 at 8:31 PM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
On Thu, Sep 1, 2011 at 11:14 PM, Robert Kern <robert.kern@gmail.com> wrote:

...
On Thu, Sep 1, 2011 at 22:07, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
So in the mean time, are there any suggestions for what this R sample function should be called, since random.sample is apparently taken?

If you default to size=1 (which you probably should anyways), then np.random.choice() makes sense, in analogy to random.choice() from the standard library.

Alright. I can make that change tomorrow. I'd prefer np.sample in the long-run, for compatibility with R. (False friends are loathsome things.) How does one petition to get function names deprecated?

I was about to argue that "random.choice" was a better name anyway, but then I remembered that the standard library "random.sample" exists and does something similar. So instead I'd like to argue that making this compatible with Python is more important than making it compatible with R :-).

Specifically, 'np.random.sample(array_like, k)', with no further arguments, should perform random sampling *without* replacement. Ideally it should also efficiently handle the case where array_like is an xrange object. Docs are here: http://docs.python.org/library/random.html

-- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

1:58 p.m.

On Fri, Sep 2, 2011 at 10:14 AM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

I made the changes discussed here and pushed them to pull request.

https://github.com/numpy/numpy/pull/143#issuecomment-1980897

I think you mean https://github.com/numpy/numpy/pull/151

...

I changed the new function's name from sample to choice and added the size=1 default as Robert suggested. I also reverted all the changes for sample, random, and ranf and added them to the reference docs.

I still think the default should be sampling without replacement, so as to reduce confusion later when we make this available as np.random.sample. You should also deprecate the current 'sample' alias (I believe this is basically a matter of documenting that it is deprecated, and making sure that it fires a deprecation warning when used). But perhaps you want to do that in a separate pull request, I dunno. -- Nathaniel

Christopher Jordan-Squire

2:45 p.m.

On Fri, Sep 2, 2011 at 12:58 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Fri, Sep 2, 2011 at 10:14 AM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
I made the changes discussed here and pushed them to pull request.

https://github.com/numpy/numpy/pull/143#issuecomment-1980897

I think you mean https://github.com/numpy/numpy/pull/151

Yes. Sorry for the mix-up.

...

...
I changed the new function's name from sample to choice and added the size=1 default as Robert suggested. I also reverted all the changes for sample, random, and ranf and added them to the reference docs.

I still think the default should be sampling without replacement, so as to reduce confusion later when we make this available as np.random.sample.

np.random.sample could be just a call to np.random.choice(blah, blah, replace=False, blah). I'm not opposed to python compatibility, I just would like a function with the defaults I want.

...

You should also deprecate the current 'sample' alias (I believe this is basically a matter of documenting that it is deprecated, and making sure that it fires a deprecation warning when used). But perhaps you want to do that in a separate pull request, I dunno.

Yes, I think I need a separate pull request after getting some agreement that it should be changed. -Chris

...

-- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Christopher Jordan-Squire

1:17 p.m.

On Thu, Sep 1, 2011 at 10:55 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Thu, Sep 1, 2011 at 8:31 PM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
On Thu, Sep 1, 2011 at 11:14 PM, Robert Kern <robert.kern@gmail.com> wrote:

...
On Thu, Sep 1, 2011 at 22:07, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
So in the mean time, are there any suggestions for what this R sample function should be called, since random.sample is apparently taken?

If you default to size=1 (which you probably should anyways), then np.random.choice() makes sense, in analogy to random.choice() from the standard library.

Alright. I can make that change tomorrow. I'd prefer np.sample in the long-run, for compatibility with R. (False friends are loathsome things.) How does one petition to get function names deprecated?

I was about to argue that "random.choice" was a better name anyway, but then I remembered that the standard library "random.sample" exists and does something similar. So instead I'd like to argue that making this compatible with Python is more important than making it compatible with R :-).

Specifically, 'np.random.sample(array_like, k)', with no further arguments, should perform random sampling *without* replacement. Ideally it should also efficiently handle the case where array_like is an xrange object. Docs are here: http://docs.python.org/library/random.html

I mentioned this at the end of my initial response to Josef. I agree that my use of sample is more consistent with the python random library's usage, except that I don't want to have the same name but a different default. I think that also counts as a 'false friend'. So I think np.random.choice is a better name for right now. Simpler to have the same function with the same defaults but a different name than the same function with the same name and different defaults. -Chris

...

-- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

4633

Age (days ago)

4634

Last active (days ago)

List overview

Download

12 comments

4 participants

participants (4)

Christopher Jordan-Squire
josef.pktd＠gmail.com
Nathaniel Smith
Robert Kern

Pull Request Review: R-like sample function

Christopher Jordan-Squire

josef.pktd＠gmail.com

Christopher Jordan-Squire

Robert Kern

Christopher Jordan-Squire

Robert Kern

Christopher Jordan-Squire

Robert Kern

Nathaniel Smith

Christopher Jordan-Squire

Nathaniel Smith

Christopher Jordan-Squire

Christopher Jordan-Squire

tags

participants (4)