[Numpy-discussion] Proposal: numpy.random.random_seed

Tue May 17 00:32:35 EDT 2016

Looking at the dask helper function again reminds me of an important cavaet
to this approach, which was pointed out to me by Clark Fitzgerald.

If you generate a moderately large number of random seeds in this fashion,
you are quite likely to have collisions due to the Birthday Paradox. For
example, you have a 50% chance of encountering at least one collision if
you generate only 77,000 seeds:
https://en.wikipedia.org/wiki/Birthday_attack

The docstring for this function should document this limitation of the
approach, which is still appropriate for a small number of seeds. Our
implementation can also encourage creating these seeds in a single
vectorized call to random_seed, which can significantly reduce the
likelihood of collisions between seeds generated in a single call to
random_seed with something like the following:

def random_seed(size):
    base = np.random.randint(2 ** 32)
    offset = np.arange(size)
    return (base + offset) % (2 ** 32)

In principle, I believe this could generate the full 2 ** 32 unique seeds
without any collisions.

Cryptography experts, please speak up if I'm mistaken here.

On Mon, May 16, 2016 at 8:54 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> I have recently encountered several use cases for randomly generate random
> number seeds:
>
> 1. When writing a library of stochastic functions that take a seed as an
> input argument, and some of these functions call multiple other such
> stochastic functions. Dask is one such example [1].
>
> 2. When a library needs to produce results that are reproducible after
> calling numpy.random.seed, but that do not want to use the functions in
> numpy.random directly. This came up recently in a pandas pull request [2],
> because we want to allow using RandomState objects as an alternative to
> global state in numpy.random. A major advantage of this approach is that it
> provides an obvious alternative to reusing the private numpy.random._mtrand
> [3].
>
> The implementation of this function (and the corresponding method on
> RandomState) is almost trivial, and I've already written such a utility for
> my code:
>
> def random_seed():
>     # numpy.random uses uint32 seeds
>     np.random.randint(2 ** 32)
>
> The advantage of adding a new method is that it avoids the need for
> explanation by making the intent of code using this pattern obvious. So I
> think it is a good candidate for inclusion in numpy.random.
>
> Any opinions?
>
> [1]
> https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/dask/utils.py#L336
> [2] https://github.com/pydata/pandas/pull/13161
> [3] On a side note, if there's no longer a good reason to keep this object
> private, perhaps we should expose it in our public API. It would certainly
> be useful -- scikit-learn is already using it (see links in the pandas PR
> above).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160516/a959ea29/attachment.html>