Mailman 3 ENH: Efficient vectorized sampling without replacement - NumPy-Discussion

Jan. 1, 2025

      Hello,

Numpy provides efficient, vectorized methods for generating random samples
of an array with replacement. However, it lacks similar functionality for
sampling *without replacement* in a vectorized manner. To address this
limitation, I developed a function capable of performing this task,
achieving approximately a 30x performance improvement over a basic Python
loop for small sample sizes (and 2x performance improvement using numba).
Could this functionality, or something similar, be integrated into numpy?
See also this issue <https://github.com/numpy/numpy/issues/28084>.

Kind regards,
Mark

def random_choice_without_replacement(array, sample_size, n_iterations):

    """
    Generates random samples from a given array without replacement.

    Parameters
    ----------
    array : array-like
        Array from which to draw the random samples.
    sample_size : int
        Number of random samples to draw without replacement per iteration.
    n_iterations : int
        Number of iterations to generate random samples.

    Returns
    -------
    random_samples : ndarray
        The generated random samples.

    Raises
    ------
    ValueError
        If sample_size is greater than the population size.

    Examples
    --------
    Generate 10 random samples from np.arange(5) of size 3 without
replacement.

    >>> array = np.arange(5)
    >>> random_choice_without_replacement(array, 3, 10)
    array([[4, 0, 1],
           [1, 4, 0],
           [1, 3, 2],
           [0, 1, 3],
           [1, 0, 2],
           [3, 2, 4],
           [0, 3, 1],
           [1, 3, 4],
           [3, 1, 4],
           [0, 1, 3]]) # random

    Generate 4 random samples from an n-dimensional array of size 3 without
replacement.

    >>> array = np.arange(10).reshape(5, 2)
    >>> random_choice_without_replacement(array, 3, 4)
    array([[[0, 1],
            [8, 9],
            [4, 5]],

           [[2, 3],
            [8, 9],
            [0, 1]],

           [[0, 1],
            [2, 3],
            [8, 9]],

           [[4, 5],
            [2, 3],
            [8, 9]]]) # random

    """

    if sample_size > len(array):
        raise ValueError(f"Sample_size ({sample_size}) is greater than the
population size ({len(array)}).")

    indices = np.tile(np.arange(len(array)), (n_iterations,1))
    random_samples = np.empty((n_iterations, sample_size), dtype=int)
    rng = np.random.default_rng()

    for i, int_max in zip(range(sample_size), reversed(range(len(array) -
sample_size, len(array)))):
        random_indices = rng.integers(0, int_max + 1, size=(n_iterations,1))
        random_samples[:, i] = np.take_along_axis(indices, random_indices,
axis=-1).T
        np.put_along_axis(indices, random_indices, indices[:,
int_max:int_max+1], axis=-1)

    return array[random_samples]

ENH: Efficient vectorized sampling without replacement

Mark

tags

participants (1)