[Numpy-discussion] align `choices` and `sample` with Python `random` module
Warren Weckesser
warren.weckesser at gmail.com
Tue Dec 11 13:50:26 EST 2018
On Tue, Dec 11, 2018 at 1:37 PM Warren Weckesser <warren.weckesser at gmail.com>
wrote:
>
>
> On Tue, Dec 11, 2018 at 10:32 AM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>
>>
>>
>> On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser <
>> warren.weckesser at gmail.com> wrote:
>>
>>>
>>>
>>> On 12/10/18, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>>> > On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <alan.isaac at gmail.com>
>>> wrote:
>>> >
>>> >> I believe this was proposed in the past to little enthusiasm,
>>> >> with the response, "you're using a library; learn its functions".
>>> >>
>>> >
>>> > Not only that, NumPy and the core libraries around it are the standard
>>> for
>>> > numerical/statistical computing. If core Python devs want to replicate
>>> a
>>> > small subset of that functionality in a new Python version like 3.6, it
>>> > would be sensible for them to choose compatible names. I don't think
>>> > there's any justification for us to bother our users based on new
>>> things
>>> > that get added to the stdlib.
>>> >
>>> >
>>> >> Nevertheless, given the addition of `choices` to the Python
>>> >> random module in 3.6, it would be nice to have the *same name*
>>> >> for parallel functionality in numpy.random.
>>> >>
>>> >> And given the redundancy of numpy.random.sample, it would be
>>> >> nice to deprecate it with the intent to reintroduce
>>> >> the name later, better aligned with Python's usage.
>>> >>
>>> >
>>> > No, there is nothing wrong with the current API, so I'm -10 on
>>> deprecating
>>> > it.
>>>
>>> Actually, the `numpy.random.choice` API has one major weakness. When
>>> `replace` is False and `size` is greater than 1, the function is actually
>>> drawing a *one* sample from a multivariate distribution. For the other
>>> multivariate distributions (multinomial, multivariate_normal and
>>> dirichlet), `size` sets the number of samples to draw from the
>>> distribution. With `replace=False` in `choice`, size becomes a *parameter*
>>> of the distribution, and it is only possible to draw one (multivariate)
>>> sample.
>>>
>>
>> I'm not sure I follow. `choice` draws samples from a given 1-D array,
>> more than 1:
>>
>> In [12]: np.random.choice(np.arange(5), size=2, replace=True)
>> Out[12]: array([2, 2])
>>
>> In [13]: np.random.choice(np.arange(5), size=2, replace=False)
>> Out[13]: array([3, 0])
>>
>> The multivariate distribution you're talking about is for generating the
>> indices I assume. Does the current implementation actually give a result
>> for size>1 that has different statistic properties from calling the
>> function N times with size=1? If so, that's definitely worth a bug report
>> at least (I don't think there is one for this).
>>
>>
> There is no bug, just a limitation in the API.
>
> When I draw without replacement, say, three values from a collection of
> length five, the three values that I get are not independent. So really,
> this is *one* sample from a three-dimensional (discrete-valued)
> distribution. The problem with the current API is that I can't get
> multiple samples from this three-dimensional distribution in one call. If
> I need to repeat the process six times, I have to use a loop, e.g.:
>
> >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False,
> size=3) for _ in range(6)]
>
> With the `select` function I described in my previous email, which I'll
> call `random_select` here, the parameter that determines the number of
> items per sample, `nsample`, is separate from the parameter that determines
> the number of samples, `size`:
>
> >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6)
> >>> samples
> array([[30, 40, 50],
> [40, 50, 30],
> [10, 20, 40],
> [20, 30, 50],
> [40, 20, 50],
> [20, 10, 30]])
>
> (`select` is a really bad name, since `numpy.select` already exists and is
> something completely different. I had the longer name `random.select` in
> mind when I started using it. "There are only two hard problems..." etc.)
>
>
As I reread this, I see another naming problem: "sample" is used to mean
different things. In my description above, I referred to one "sample" as
the length-3 sequence generated by one call to `numpy.random.choice([10,
20, 30, 40, 50], replace=False, size=3)`, but in `random_select`, `nsample`
refers to the length of each sequence generated. I use the name 'nsample'
to be consistent with `numpy.random.hypergeometric`. I hope the output of
the `random_select` call shown above makes clear the desired behavior.
Warren
Warren
>
>
>
>> Cheers,
>> Ralf
>>
>>
>>
>>> I thought about this some time ago, and came up with an API that
>>> eliminates the boolean flag, and separates the `size` argument from the
>>> number of items drawn in one sample, which I'll call `nsample`. To avoid
>>> creating a "false friend" with the standard library and with numpy's
>>> `choice`, I'll call this function `select`.
>>>
>>> Here's the proposed signature and docstring. (A prototype
>>> implementation is in a gist at
>>> https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)
>>> The key feature is the `nsample` argument, which sets how many items to
>>> select from the given collection. Also note that this function is *always*
>>> drawing *without replacement*. It covers the `replace=True` case because
>>> drawing one item without replacement is the same as drawing one item with
>>> replacement.
>>>
>>> Whether or not an API like the following is used, it would be nice if
>>> there was some way to get multiple samples in the `replace=False` case in
>>> one function call.
>>>
>>> def select(items, nsample=None, p=None, size=None):
>>> """
>>> Select random samples from `items`.
>>>
>>> The function randomly selects `nsample` items from `items` without
>>> replacement.
>>>
>>> Parameters
>>> ----------
>>> items : sequence
>>> The collection of items from which the selection is made.
>>> nsample : int, optional
>>> Number of items to select without replacement in each draw.
>>> It must be between 0 and len(items), inclusize.
>>> p : array-like of floats, same length as `items, optional
>>> Probabilities of the items. If this argument is not given,
>>> the elements in `items` are assumed to have equal probability.
>>> size : int, optional
>>> Number of variates to draw.
>>>
>>> Notes
>>> -----
>>> `size=None` means "generate a single selection".
>>>
>>> If `size` is None, the result is equivalent to
>>> numpy.random.choice(items, size=nsample, replace=False)
>>>
>>> `nsample=None` means draw one (scalar) sample.
>>> If `nsample` is None, the functon acts (almost) like nsample=1 (see
>>> below for more information), and the result is equivalent to
>>> numpy.random.choice(items, size=size)
>>> In effect, it does choice with replacement. The case `nsample=None`
>>> can be interpreted as each sample is a scalar, and `nsample=k`
>>> means each sample is a sequence with length k.
>>>
>>> If `nsample` is not None, it must be a nonnegative integer with
>>> 0 <= nsample <= len(items).
>>>
>>> If `size` is not None, it must be an integer or a tuple of integers.
>>> When `size` is an integer, it is treated as the tuple ``(size,)``.
>>>
>>> When both `nsample` and `size` are not None, the result
>>> has shape ``size + (nsample,)``.
>>>
>>> Examples
>>> --------
>>> Make 6 choices with replacement from [10, 20, 30, 40]. (This is
>>> equivalent to "Make 1 choice without replacement from [10, 20, 30,
>>> 40];
>>> do it six times.")
>>>
>>> >>> select([10, 20, 30, 40], size=6)
>>> array([20, 20, 40, 10, 40, 30])
>>>
>>> Choose two items from [10, 20, 30, 40] without replacement. Do it
>>> six
>>> times.
>>>
>>> >>> select([10, 20, 30, 40], nsample=2, size=6)
>>> array([[40, 10],
>>> [20, 30],
>>> [10, 40],
>>> [30, 10],
>>> [10, 30],
>>> [10, 20]])
>>>
>>> When `nsample` is an integer, there is always an axis at the end of
>>> the
>>> result with length `nsample`, even when `nsample=1`. For example,
>>> the
>>> shape of the array returned in the following call is (2, 3, 1)
>>>
>>> >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
>>> array([[[10],
>>> [30],
>>> [20]],
>>>
>>> [[10],
>>> [40],
>>> [20]]])
>>>
>>> When `nsample` is None, it acts like `nsample=1`, but the trivial
>>> dimension is not included. The shape of the array returned in the
>>> following call is (2, 3).
>>>
>>> >>> select([10, 20, 30, 40], size=(2, 3))
>>> array([[20, 40, 30],
>>> [30, 20, 40]])
>>>
>>> """
>>>
>>>
>>> Warren
>>>
>>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20181211/d988f155/attachment-0001.html>
More information about the NumPy-Discussion
mailing list