[Numpy-discussion] align `choices` and `sample` with Python `random` module

Mon Dec 10 13:27:18 EST 2018

On 12/10/18, Ralf Gommers <ralf.gommers at gmail.com> wrote:
> On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <alan.isaac at gmail.com> wrote:
>
>> I believe this was proposed in the past to little enthusiasm,
>> with the response, "you're using a library; learn its functions".
>>
>
> Not only that, NumPy and the core libraries around it are the standard for
> numerical/statistical computing. If core Python devs want to replicate a
> small subset of that functionality in a new Python version like 3.6, it
> would be sensible for them to choose compatible names. I don't think
> there's any justification for us to bother our users based on new things
> that get added to the stdlib.
>
>
>> Nevertheless, given the addition of `choices` to the Python
>> random module in 3.6, it would be nice to have the *same name*
>> for parallel functionality in numpy.random.
>>
>> And given the redundancy of numpy.random.sample, it would be
>> nice to deprecate it with the intent to reintroduce
>> the name later, better aligned with Python's usage.
>>
>
> No, there is nothing wrong with the current API, so I'm -10 on deprecating
> it.

Actually, the `numpy.random.choice` API has one major weakness.  When
`replace` is False and `size` is greater than 1, the function is actually
drawing a *one* sample from a multivariate distribution.  For the other
multivariate distributions (multinomial, multivariate_normal and
dirichlet), `size` sets the number of samples to draw from the
distribution.  With `replace=False` in `choice`, size becomes a *parameter*
of the distribution, and it is only possible to draw one (multivariate)
sample.

I thought about this some time ago, and came up with an API that eliminates
the boolean flag, and separates the `size` argument from the number of
items drawn in one sample, which I'll call `nsample`. To avoid creating a
"false friend" with the standard library and with numpy's `choice`, I'll
call this function `select`.

Here's the proposed signature and docstring.  (A prototype implementation
is in a gist at
https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)
The key feature is the `nsample` argument, which sets how many items to
select from the given collection.  Also note that this function is *always*
drawing *without replacement*.  It covers the `replace=True` case because
drawing one item without replacement is the same as drawing one item with
replacement.

Whether or not an API like the following is used, it would be nice if there
was some way to get multiple samples in the `replace=False` case in one
function call.

def select(items, nsample=None, p=None, size=None):
    """
    Select random samples from `items`.

    The function randomly selects `nsample` items from `items` without
    replacement.

    Parameters
    ----------
    items : sequence
        The collection of items from which the selection is made.
    nsample : int, optional
        Number of items to select without replacement in each draw.
        It must be between 0 and len(items), inclusize.
    p : array-like of floats, same length as `items, optional
        Probabilities of the items.  If this argument is not given,
        the elements in `items` are assumed to have equal probability.
    size : int, optional
        Number of variates to draw.

    Notes
    -----
    `size=None` means "generate a single selection".

    If `size` is None, the result is equivalent to
        numpy.random.choice(items, size=nsample, replace=False)

    `nsample=None` means draw one (scalar) sample.
    If `nsample` is None, the functon acts (almost) like nsample=1 (see
    below for more information), and the result is equivalent to
        numpy.random.choice(items, size=size)
    In effect, it does choice with replacement.  The case `nsample=None`
    can be interpreted as each sample is a scalar, and `nsample=k`
    means each sample is a sequence with length k.

    If `nsample` is not None, it must be a nonnegative integer with
    0 <= nsample <= len(items).

    If `size` is not None, it must be an integer or a tuple of integers.
    When `size` is an integer, it is treated as the tuple ``(size,)``.

    When both `nsample` and `size` are not None, the result
    has shape ``size + (nsample,)``.

    Examples
    --------
    Make 6 choices with replacement from [10, 20, 30, 40].  (This is
    equivalent to "Make 1 choice without replacement from [10, 20, 30, 40];
    do it six times.")

    >>> select([10, 20, 30, 40], size=6)
    array([20, 20, 40, 10, 40, 30])

    Choose two items from [10, 20, 30, 40] without replacement.  Do it six
    times.

    >>> select([10, 20, 30, 40], nsample=2, size=6)
    array([[40, 10],
           [20, 30],
           [10, 40],
           [30, 10],
           [10, 30],
           [10, 20]])

    When `nsample` is an integer, there is always an axis at the end of the
    result with length `nsample`, even when `nsample=1`.  For example, the
    shape of the array returned in the following call is (2, 3, 1)

    >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
    array([[[10],
            [30],
            [20]],

           [[10],
            [40],
            [20]]])

    When `nsample` is None, it acts like `nsample=1`, but the trivial
    dimension is not included.  The shape of the array returned in the
    following call is (2, 3).

    >>> select([10, 20, 30, 40], size=(2, 3))
    array([[20, 40, 30],
           [30, 20, 40]])

    """

Warren

>
> Ralf
>
>
>> Obviously numpy.random.choice exists for both cases,
>> so this comment is not about functionality.
>> And I accept that some will think it is not about anything.
>> Perhaps it might be at least seen as being about this:
>> using the same function (`choice`) with a boolean argument
>> (`replace`) to switch between sampling strategies at least
>> appears to violate the proposal floated at times on this
>> list that called for two separate functions in apparently
>> similar cases.  (I am not at all trying to claim that the
>> argument against flag parameters is definitive; I'm just
>> mentioning that this viewpoint has already been
>> promulgated on this list.)
>>
>> Cheers, Alan Isaac
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20181210/37aa3a9a/attachment-0001.html>