[Numpy-discussion] align `choices` and `sample` with Python `random` module

Tue Dec 11 10:30:27 EST 2018

On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser <
warren.weckesser at gmail.com> wrote:

>
>
> On 12/10/18, Ralf Gommers <ralf.gommers at gmail.com> wrote:
> > On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <alan.isaac at gmail.com> wrote:
> >
> >> I believe this was proposed in the past to little enthusiasm,
> >> with the response, "you're using a library; learn its functions".
> >>
> >
> > Not only that, NumPy and the core libraries around it are the standard
> for
> > numerical/statistical computing. If core Python devs want to replicate a
> > small subset of that functionality in a new Python version like 3.6, it
> > would be sensible for them to choose compatible names. I don't think
> > there's any justification for us to bother our users based on new things
> > that get added to the stdlib.
> >
> >
> >> Nevertheless, given the addition of `choices` to the Python
> >> random module in 3.6, it would be nice to have the *same name*
> >> for parallel functionality in numpy.random.
> >>
> >> And given the redundancy of numpy.random.sample, it would be
> >> nice to deprecate it with the intent to reintroduce
> >> the name later, better aligned with Python's usage.
> >>
> >
> > No, there is nothing wrong with the current API, so I'm -10 on
> deprecating
> > it.
>
> Actually, the `numpy.random.choice` API has one major weakness.  When
> `replace` is False and `size` is greater than 1, the function is actually
> drawing a *one* sample from a multivariate distribution.  For the other
> multivariate distributions (multinomial, multivariate_normal and
> dirichlet), `size` sets the number of samples to draw from the
> distribution.  With `replace=False` in `choice`, size becomes a *parameter*
> of the distribution, and it is only possible to draw one (multivariate)
> sample.
>

I'm not sure I follow. `choice` draws samples from a given 1-D array, more
than 1:

In [12]: np.random.choice(np.arange(5), size=2, replace=True)
Out[12]: array([2, 2])

In [13]: np.random.choice(np.arange(5), size=2, replace=False)
Out[13]: array([3, 0])

The multivariate distribution you're talking about is for generating the
indices I assume. Does the current implementation actually give a result
for size>1 that has different statistic properties from calling the
function N times with size=1? If so, that's definitely worth a bug report
at least (I don't think there is one for this).

Cheers,
Ralf

> I thought about this some time ago, and came up with an API that
> eliminates the boolean flag, and separates the `size` argument from the
> number of items drawn in one sample, which I'll call `nsample`. To avoid
> creating a "false friend" with the standard library and with numpy's
> `choice`, I'll call this function `select`.
>
> Here's the proposed signature and docstring.  (A prototype implementation
> is in a gist at
> https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)
> The key feature is the `nsample` argument, which sets how many items to
> select from the given collection.  Also note that this function is *always*
> drawing *without replacement*.  It covers the `replace=True` case because
> drawing one item without replacement is the same as drawing one item with
> replacement.
>
> Whether or not an API like the following is used, it would be nice if
> there was some way to get multiple samples in the `replace=False` case in
> one function call.
>
> def select(items, nsample=None, p=None, size=None):
>     """
>     Select random samples from `items`.
>
>     The function randomly selects `nsample` items from `items` without
>     replacement.
>
>     Parameters
>     ----------
>     items : sequence
>         The collection of items from which the selection is made.
>     nsample : int, optional
>         Number of items to select without replacement in each draw.
>         It must be between 0 and len(items), inclusize.
>     p : array-like of floats, same length as `items, optional
>         Probabilities of the items.  If this argument is not given,
>         the elements in `items` are assumed to have equal probability.
>     size : int, optional
>         Number of variates to draw.
>
>     Notes
>     -----
>     `size=None` means "generate a single selection".
>
>     If `size` is None, the result is equivalent to
>         numpy.random.choice(items, size=nsample, replace=False)
>
>     `nsample=None` means draw one (scalar) sample.
>     If `nsample` is None, the functon acts (almost) like nsample=1 (see
>     below for more information), and the result is equivalent to
>         numpy.random.choice(items, size=size)
>     In effect, it does choice with replacement.  The case `nsample=None`
>     can be interpreted as each sample is a scalar, and `nsample=k`
>     means each sample is a sequence with length k.
>
>     If `nsample` is not None, it must be a nonnegative integer with
>     0 <= nsample <= len(items).
>
>     If `size` is not None, it must be an integer or a tuple of integers.
>     When `size` is an integer, it is treated as the tuple ``(size,)``.
>
>     When both `nsample` and `size` are not None, the result
>     has shape ``size + (nsample,)``.
>
>     Examples
>     --------
>     Make 6 choices with replacement from [10, 20, 30, 40].  (This is
>     equivalent to "Make 1 choice without replacement from [10, 20, 30, 40];
>     do it six times.")
>
>     >>> select([10, 20, 30, 40], size=6)
>     array([20, 20, 40, 10, 40, 30])
>
>     Choose two items from [10, 20, 30, 40] without replacement.  Do it six
>     times.
>
>     >>> select([10, 20, 30, 40], nsample=2, size=6)
>     array([[40, 10],
>            [20, 30],
>            [10, 40],
>            [30, 10],
>            [10, 30],
>            [10, 20]])
>
>     When `nsample` is an integer, there is always an axis at the end of the
>     result with length `nsample`, even when `nsample=1`.  For example, the
>     shape of the array returned in the following call is (2, 3, 1)
>
>     >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
>     array([[[10],
>             [30],
>             [20]],
>
>            [[10],
>             [40],
>             [20]]])
>
>     When `nsample` is None, it acts like `nsample=1`, but the trivial
>     dimension is not included.  The shape of the array returned in the
>     following call is (2, 3).
>
>     >>> select([10, 20, 30, 40], size=(2, 3))
>     array([[20, 40, 30],
>            [30, 20, 40]])
>
>     """
>
>
> Warren
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20181211/33bffa21/attachment.html>