[Numpy-discussion] Proposal: numpy.random.random_seed

Tue May 17 04:49:45 EDT 2016

On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
> On Tue, May 17, 2016 at 12:18 AM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
>> > 1. When writing a library of stochastic functions that take a seed as
an input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
>>
>> Can you clarify the use case here? I don't really know what you are
doing here, but I'm pretty sure this is not the right approach.
>
> Here's a contrived example. Suppose I've written a simulator for cars
that consists of a number of loosely connected components (e.g., an engine,
brakes, etc.). The behavior of each component of our simulator is
stochastic, but we want everything to be fully reproducible, so we need to
use seeds or RandomState objects.
>
> We might write our simulate_car function like the following:
>
> def simulate_car(engine_config, brakes_config, seed=None):
>     rs = np.random.RandomState(seed)
>     engine = simulate_engine(engine_config, seed=rs.random_seed())
>     brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
>     ...
>
> The problem with passing the same RandomState object (either explicitly
or dropping the seed argument entirely and using the  global state) to both
simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
change what I do inside simulate_engine, it also effects the brakes.

That's a little too contrived, IMO. In most such simulations, the different
components interact with each other in the normal course of the simulation;
that's why they are both joined together in the same simulation instead of
being two separate runs. Unless if the components are being run across a
process or thread boundary (a la dask below) where true nondeterminism
comes into play, then I don't think you want these semi-independent
streams. This seems to be the advice du jour from the agent-based modeling
community.

> The dask use case is actually pretty different -- the intent is to create
many random numbers in parallel using multiple threads or processes
(possibly in a distributed fashion). I know that skipping ahead is the
standard way to get independent number streams for parallel sampling, but
that isn't exposed in numpy.random, and setting distinct seeds seems like a
reasonable alternative for scientific computing use cases.

Forget about integer seeds. Those are for human convenience. If you're not
jotting them down in your lab notebook in pen, you don't want an integer
seed.

What you want is a function that returns many RandomState objects that are
hopefully spread around the MT19937 space enough that they are essentially
independent (in the absence of true jumpahead). The better implementation
of such a function would look something like this:

def spread_out_prngs(n, root_prng=None):
    if root_prng is None:
        root_prng = np.random
    elif not isinstance(root_prng, np.random.RandomState):
        root_prng = np.random.RandomState(root_prng)
    sprouted_prngs = []
    for i in range(n):
        seed_array = root_prng.randint(1<<32, size=624)  # dtype=np.uint32
under 1.11
        sprouted_prngs.append(np.random.RandomState(seed_array))
    return spourted_prngs

Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
streams:

  https://github.com/bashtage/ng-numpy-randomstate

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160517/4a37ecb4/attachment.html>