[Numpy-discussion] random number generator, entropy and pickling

Mon Apr 25 14:05:05 EDT 2011

On Mon, Apr 25, 2011 at 9:57 AM, Gael Varoquaux
<gael.varoquaux at normalesup.org> wrote:
>
> We thought that we could simply have a PRNG per object, as in:
>
>    def __init__(self, prng=None):
>        if prng is None:
>            prng = np.random.RandomState()
>        self.prng = prng
>
> I don't like this option, because it means that with a given pieve of
> code, setting the seed of numpy's PRNG isn't enough to make it
> reproducible.
>
> I couldn't retrieve a handle on a picklable instance for the global PRNG.
>
> The only option I can see would be to use the global numpy PRNG to seed
> an instance specific RandomState, as in:
>
>    def __init__(self, prng=None):
>        if prng is None:
>            prng = np.random.RandomState(np.random.random())
>        self.prng = prng
>
> That way seeding the global PRNG really does control the full random
> number generation. I am wondering if it would have an adverse consequence
> on the entropy of the stream of random numbers. Does anybody have
> suggestions? Advices?
>

If code A relies on code B (eg, some numpy function) and code B
changes, then the stream of random numbers will no longer be the same.
 The point here is that the user wrote code A but depended on code B,
and even though code A was unchanged, their random numbers were not
the same.

The situation is improved if scikits.learn used its own global
RandomState instance.  Then code A will at least give the same stream
of random numbers for a fixed version of scikits.learn.  It should be
made very clear though that the data stream cannot be expected to be
the same across versions.

As to each object having its own RandomState instance, I definitely
see that it makes restoring the overall state of a piece of code
harder, but perhaps utility functions could make this easier.  I can
imagine that users might want to arbitrarily set the seed for a
particular object in the midsts of a larger piece of code.  Perhaps
the user is testing known failure cases of a algorithm A interacting
with algorithm B.  If the user wants to loop over known seeds which
cause algorithm A to fail but needs algorithm B to keep its seed
fixed, then it seems like having a global seed makes this more
difficult.  In that sense, it might be desirable to have "independent"
prngs. On the other hand, maybe that is an uncommon use case that
could be handled through manually setting the seed.

Post back on what you guys decide.