[Numpy-discussion] align `choices` and `sample` with Python `random` module

Tue Dec 11 13:50:26 EST 2018

On Tue, Dec 11, 2018 at 1:37 PM Warren Weckesser <warren.weckesser at gmail.com>
wrote:

>
>
> On Tue, Dec 11, 2018 at 10:32 AM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>
>>
>>
>> On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser <
>> warren.weckesser at gmail.com> wrote:
>>
>>>
>>>
>>> On 12/10/18, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>>> > On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <alan.isaac at gmail.com>
>>> wrote:
>>> >
>>> >> I believe this was proposed in the past to little enthusiasm,
>>> >> with the response, "you're using a library; learn its functions".
>>> >>
>>> >
>>> > Not only that, NumPy and the core libraries around it are the standard
>>> for
>>> > numerical/statistical computing. If core Python devs want to replicate
>>> a
>>> > small subset of that functionality in a new Python version like 3.6, it
>>> > would be sensible for them to choose compatible names. I don't think
>>> > there's any justification for us to bother our users based on new
>>> things
>>> > that get added to the stdlib.
>>> >
>>> >
>>> >> Nevertheless, given the addition of `choices` to the Python
>>> >> random module in 3.6, it would be nice to have the *same name*
>>> >> for parallel functionality in numpy.random.
>>> >>
>>> >> And given the redundancy of numpy.random.sample, it would be
>>> >> nice to deprecate it with the intent to reintroduce
>>> >> the name later, better aligned with Python's usage.
>>> >>
>>> >
>>> > No, there is nothing wrong with the current API, so I'm -10 on
>>> deprecating
>>> > it.
>>>
>>> Actually, the `numpy.random.choice` API has one major weakness.  When
>>> `replace` is False and `size` is greater than 1, the function is actually
>>> drawing a *one* sample from a multivariate distribution.  For the other
>>> multivariate distributions (multinomial, multivariate_normal and
>>> dirichlet), `size` sets the number of samples to draw from the
>>> distribution.  With `replace=False` in `choice`, size becomes a *parameter*
>>> of the distribution, and it is only possible to draw one (multivariate)
>>> sample.
>>>
>>
>> I'm not sure I follow. `choice` draws samples from a given 1-D array,
>> more than 1:
>>
>> In [12]: np.random.choice(np.arange(5), size=2, replace=True)
>> Out[12]: array([2, 2])
>>
>> In [13]: np.random.choice(np.arange(5), size=2, replace=False)
>> Out[13]: array([3, 0])
>>
>> The multivariate distribution you're talking about is for generating the
>> indices I assume. Does the current implementation actually give a result
>> for size>1 that has different statistic properties from calling the
>> function N times with size=1? If so, that's definitely worth a bug report
>> at least (I don't think there is one for this).
>>
>>
> There is no bug, just a limitation in the API.
>
> When I draw without replacement, say, three values from a collection of
> length five, the three values that I get are not independent.  So really,
> this is *one* sample from a three-dimensional (discrete-valued)
> distribution.  The problem with the current API is that I can't get
> multiple samples from this three-dimensional distribution in one call.  If
> I need to repeat the process six times, I have to use a loop, e.g.:
>
>     >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False,
> size=3) for _ in range(6)]
>
> With the `select` function I described in my previous email, which I'll
> call `random_select` here, the parameter that determines the number of
> items per sample, `nsample`, is separate from the parameter that determines
> the number of samples, `size`:
>
>     >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6)
>     >>> samples
>     array([[30, 40, 50],
>            [40, 50, 30],
>            [10, 20, 40],
>            [20, 30, 50],
>            [40, 20, 50],
>            [20, 10, 30]])
>
> (`select` is a really bad name, since `numpy.select` already exists and is
> something completely different.  I had the longer name `random.select` in
> mind when I started using it. "There are only two hard problems..." etc.)
>
>

As I reread this, I see another naming problem:  "sample" is used to mean
different things.  In my description above,  I referred to one "sample" as
the length-3 sequence generated by one call to `numpy.random.choice([10,
20, 30, 40, 50], replace=False, size=3)`, but in `random_select`, `nsample`
refers to the length of each sequence generated.   I use the name 'nsample'
to be consistent with `numpy.random.hypergeometric`.  I hope the output of
the `random_select` call shown above makes clear the desired behavior.

Warren

Warren
>
>
>
>> Cheers,
>> Ralf
>>
>>
>>
>>> I thought about this some time ago, and came up with an API that
>>> eliminates the boolean flag, and separates the `size` argument from the
>>> number of items drawn in one sample, which I'll call `nsample`. To avoid
>>> creating a "false friend" with the standard library and with numpy's
>>> `choice`, I'll call this function `select`.
>>>
>>> Here's the proposed signature and docstring.  (A prototype
>>> implementation is in a gist at
>>> https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)
>>> The key feature is the `nsample` argument, which sets how many items to
>>> select from the given collection.  Also note that this function is *always*
>>> drawing *without replacement*.  It covers the `replace=True` case because
>>> drawing one item without replacement is the same as drawing one item with
>>> replacement.
>>>
>>> Whether or not an API like the following is used, it would be nice if
>>> there was some way to get multiple samples in the `replace=False` case in
>>> one function call.
>>>
>>> def select(items, nsample=None, p=None, size=None):
>>>     """
>>>     Select random samples from `items`.
>>>
>>>     The function randomly selects `nsample` items from `items` without
>>>     replacement.
>>>
>>>     Parameters
>>>     ----------
>>>     items : sequence
>>>         The collection of items from which the selection is made.
>>>     nsample : int, optional
>>>         Number of items to select without replacement in each draw.
>>>         It must be between 0 and len(items), inclusize.
>>>     p : array-like of floats, same length as `items, optional
>>>         Probabilities of the items.  If this argument is not given,
>>>         the elements in `items` are assumed to have equal probability.
>>>     size : int, optional
>>>         Number of variates to draw.
>>>
>>>     Notes
>>>     -----
>>>     `size=None` means "generate a single selection".
>>>
>>>     If `size` is None, the result is equivalent to
>>>         numpy.random.choice(items, size=nsample, replace=False)
>>>
>>>     `nsample=None` means draw one (scalar) sample.
>>>     If `nsample` is None, the functon acts (almost) like nsample=1 (see
>>>     below for more information), and the result is equivalent to
>>>         numpy.random.choice(items, size=size)
>>>     In effect, it does choice with replacement.  The case `nsample=None`
>>>     can be interpreted as each sample is a scalar, and `nsample=k`
>>>     means each sample is a sequence with length k.
>>>
>>>     If `nsample` is not None, it must be a nonnegative integer with
>>>     0 <= nsample <= len(items).
>>>
>>>     If `size` is not None, it must be an integer or a tuple of integers.
>>>     When `size` is an integer, it is treated as the tuple ``(size,)``.
>>>
>>>     When both `nsample` and `size` are not None, the result
>>>     has shape ``size + (nsample,)``.
>>>
>>>     Examples
>>>     --------
>>>     Make 6 choices with replacement from [10, 20, 30, 40].  (This is
>>>     equivalent to "Make 1 choice without replacement from [10, 20, 30,
>>> 40];
>>>     do it six times.")
>>>
>>>     >>> select([10, 20, 30, 40], size=6)
>>>     array([20, 20, 40, 10, 40, 30])
>>>
>>>     Choose two items from [10, 20, 30, 40] without replacement.  Do it
>>> six
>>>     times.
>>>
>>>     >>> select([10, 20, 30, 40], nsample=2, size=6)
>>>     array([[40, 10],
>>>            [20, 30],
>>>            [10, 40],
>>>            [30, 10],
>>>            [10, 30],
>>>            [10, 20]])
>>>
>>>     When `nsample` is an integer, there is always an axis at the end of
>>> the
>>>     result with length `nsample`, even when `nsample=1`.  For example,
>>> the
>>>     shape of the array returned in the following call is (2, 3, 1)
>>>
>>>     >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
>>>     array([[[10],
>>>             [30],
>>>             [20]],
>>>
>>>            [[10],
>>>             [40],
>>>             [20]]])
>>>
>>>     When `nsample` is None, it acts like `nsample=1`, but the trivial
>>>     dimension is not included.  The shape of the array returned in the
>>>     following call is (2, 3).
>>>
>>>     >>> select([10, 20, 30, 40], size=(2, 3))
>>>     array([[20, 40, 30],
>>>            [30, 20, 40]])
>>>
>>>     """
>>>
>>>
>>> Warren
>>>
>>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20181211/d988f155/attachment-0001.html>