[Numpy-discussion] Pull Request Review: R-like sample function

Christopher Jordan-Squire cjordan1 at uw.edu
Thu Sep 1 22:39:48 EDT 2011


On Thu, Sep 1, 2011 at 10:01 PM,  <josef.pktd at gmail.com> wrote:
> On Thu, Sep 1, 2011 at 6:02 PM, Christopher Jordan-Squire
> <cjordan1 at uw.edu> wrote:
>> Hi--I've just submitted a numpy 2.0 pull request for a function sample
>> in np.random. It's essentially an implementation of R's sample
>> function. It allows possibly non-uniform, possibly without-replacement
>> sampling from a given 1-D array-like. This is very useful for quickly
>> and cleanly creating samples from, for example, a list of strings or a
>> list of non-contiguous, non-evenly spaced integers. Both occur in data
>> analysis with categorical data.
>>
>> It is, essentially, a convenience function that wraps a number of
>> existing ways to take a random sample. I think it belongs in
>> numpy.random rather than scipy.stats because it's just a random
>> sampler, rather than a probability distribution. It isn't possible to
>> define a scipy.stats discrete random variable on strings--it would
>> have to instead be done on the indices of the list containing the
>> possible samples. And (as far as I can tell) the scipy.stats
>> distributions can't be used for sampling without replacement.
>>
>> https://github.com/numpy/numpy/pull/151
>
> I don't think you can kill numpy.random.random and similar mixed in
> with an adding a new function commit.
>

Killjoy.

> First these functions would need to be deprecated.
>

I discussed this with a few other people, and they suggested that it
could be alright since it's for numpy 2.0 rather than numpy 1.x. For
the 2.0 version it would be perfectly reasonable to have a break with
the API. (Though, as I said, it's not a break with the API.)

> "it does not break the API as the previous function was not in the docs"
>
> This is a doc bug, I assume. I don't think it means users/developers
> don't rely on it.
>

You apparently don't subscribe to the view that the API is infallible
revelation revealed.

(That's a joke, if it's not obvious.)

> searching for np.random.random shows 120 threads in my gmail reader,
> python uses random.random()
> dir(np.random) shows it
> I copied it from mailing list examples. It's used quite a bit in
> scipy, as I saw because of your work.
>
> I also find the historical multiplicity of aliases confusing, but
> which names should be deprecated would at least require a discussion
> and a separate commit.
>
> Josef
>

I hadn't thought about the random.random connection. I'm fine with
leaving random.random as an alias for random.random_sample. I just
wanted to claim random.sample for my own function.

I can't think of many other instances of aliased functions like that
in numpy, though--but perhaps I'm not thinking hard enough. It
certainly seemed strange to have 4 names for the same function.

Now that you mention the standard library connection, my use of
random.sample seems more in line with the standard library random
package than the alias to random.random_sample. Though that would
suggest using the default replace=True, which I'd prefer not to do.

-Chris JS

>
>>
>> -Chris JS
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list