I have recently encountered several use cases for randomly generate random number seeds: 1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1]. 2. When a library needs to produce results that are reproducible after calling numpy.random.seed, but that do not want to use the functions in numpy.random directly. This came up recently in a pandas pull request [2], because we want to allow using RandomState objects as an alternative to global state in numpy.random. A major advantage of this approach is that it provides an obvious alternative to reusing the private numpy.random._mtrand [3]. The implementation of this function (and the corresponding method on RandomState) is almost trivial, and I've already written such a utility for my code: def random_seed(): # numpy.random uses uint32 seeds np.random.randint(2 ** 32) The advantage of adding a new method is that it avoids the need for explanation by making the intent of code using this pattern obvious. So I think it is a good candidate for inclusion in numpy.random. Any opinions? [1] https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/d... [2] https://github.com/pydata/pandas/pull/13161 [3] On a side note, if there's no longer a good reason to keep this object private, perhaps we should expose it in our public API. It would certainly be useful -- scikit-learn is already using it (see links in the pandas PR above).
Looking at the dask helper function again reminds me of an important cavaet to this approach, which was pointed out to me by Clark Fitzgerald. If you generate a moderately large number of random seeds in this fashion, you are quite likely to have collisions due to the Birthday Paradox. For example, you have a 50% chance of encountering at least one collision if you generate only 77,000 seeds: https://en.wikipedia.org/wiki/Birthday_attack The docstring for this function should document this limitation of the approach, which is still appropriate for a small number of seeds. Our implementation can also encourage creating these seeds in a single vectorized call to random_seed, which can significantly reduce the likelihood of collisions between seeds generated in a single call to random_seed with something like the following: def random_seed(size): base = np.random.randint(2 ** 32) offset = np.arange(size) return (base + offset) % (2 ** 32) In principle, I believe this could generate the full 2 ** 32 unique seeds without any collisions. Cryptography experts, please speak up if I'm mistaken here. On Mon, May 16, 2016 at 8:54 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I have recently encountered several use cases for randomly generate random number seeds:
1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after calling numpy.random.seed, but that do not want to use the functions in numpy.random directly. This came up recently in a pandas pull request [2], because we want to allow using RandomState objects as an alternative to global state in numpy.random. A major advantage of this approach is that it provides an obvious alternative to reusing the private numpy.random._mtrand [3].
The implementation of this function (and the corresponding method on RandomState) is almost trivial, and I've already written such a utility for my code:
def random_seed(): # numpy.random uses uint32 seeds np.random.randint(2 ** 32)
The advantage of adding a new method is that it avoids the need for explanation by making the intent of code using this pattern obvious. So I think it is a good candidate for inclusion in numpy.random.
Any opinions?
[1] https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/d... [2] https://github.com/pydata/pandas/pull/13161 [3] On a side note, if there's no longer a good reason to keep this object private, perhaps we should expose it in our public API. It would certainly be useful -- scikit-learn is already using it (see links in the pandas PR above).
On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
I have recently encountered several use cases for randomly generate
random number seeds:
1. When writing a library of stochastic functions that take a seed as an
2. When a library needs to produce results that are reproducible after calling numpy.random.seed, but that do not want to use the functions in numpy.random directly. This came up recently in a pandas pull request [2], because we want to allow using RandomState objects as an alternative to global state in numpy.random. A major advantage of this approach is that it
[3] On a side note, if there's no longer a good reason to keep this object private, perhaps we should expose it in our public API. It would certainly be useful -- scikit-learn is already using it (see links in the
input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1]. Can you clarify the use case here? I don't really know what you are doing here, but I'm pretty sure this is not the right approach. provides an obvious alternative to reusing the private numpy.random._mtrand [3]. It's only pseudo-private. This is an authorized use of it. However, for this case, I usually just pass around the the numpy.random module itself and let duck-typing take care of the rest. pandas PR above). Adding a public get_global_random_state() function might be in order. Originally, I wanted there to be *some* barrier to entry, but just grabbing it to use as a default RandomState object is definitely an intended use of it. It's not going to disappear. -- Robert Kern
On Tue, May 17, 2016 at 12:18 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1].
Can you clarify the use case here? I don't really know what you are doing here, but I'm pretty sure this is not the right approach.
Here's a contrived example. Suppose I've written a simulator for cars that consists of a number of loosely connected components (e.g., an engine, brakes, etc.). The behavior of each component of our simulator is stochastic, but we want everything to be fully reproducible, so we need to use seeds or RandomState objects. We might write our simulate_car function like the following: def simulate_car(engine_config, brakes_config, seed=None): rs = np.random.RandomState(seed) engine = simulate_engine(engine_config, seed=rs.random_seed()) brakes = simulate_brakes(brakes_config, seed=rs.random_seed()) ... The problem with passing the same RandomState object (either explicitly or dropping the seed argument entirely and using the global state) to both simulate_engine and simulate_breaks is that it breaks encapsulation -- if I change what I do inside simulate_engine, it also effects the brakes. The dask use case is actually pretty different -- the intent is to create many random numbers in parallel using multiple threads or processes (possibly in a distributed fashion). I know that skipping ahead is the standard way to get independent number streams for parallel sampling, but that isn't exposed in numpy.random, and setting distinct seeds seems like a reasonable alternative for scientific computing use cases.
It's only pseudo-private. This is an authorized use of it.
However, for this case, I usually just pass around the the numpy.random module itself and let duck-typing take care of the rest.
I like the duck-typing approach. That's very elegant. If this is an authorized use of the global RandomState object, let's document it! Otherwise cautious library maintainers like myself will discourage using it :).
[3] On a side note, if there's no longer a good reason to keep this object private, perhaps we should expose it in our public API. It would certainly be useful -- scikit-learn is already using it (see links in the pandas PR above).
Adding a public get_global_random_state() function might be in order.
Yes, possibly.
On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Tue, May 17, 2016 at 12:18 AM, Robert Kern <robert.kern@gmail.com>
On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
1. When writing a library of stochastic functions that take a seed as
an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1].
Can you clarify the use case here? I don't really know what you are
doing here, but I'm pretty sure this is not the right approach.
Here's a contrived example. Suppose I've written a simulator for cars
wrote: that consists of a number of loosely connected components (e.g., an engine, brakes, etc.). The behavior of each component of our simulator is stochastic, but we want everything to be fully reproducible, so we need to use seeds or RandomState objects.
We might write our simulate_car function like the following:
def simulate_car(engine_config, brakes_config, seed=None): rs = np.random.RandomState(seed) engine = simulate_engine(engine_config, seed=rs.random_seed()) brakes = simulate_brakes(brakes_config, seed=rs.random_seed()) ...
The problem with passing the same RandomState object (either explicitly
The dask use case is actually pretty different -- the intent is to create many random numbers in parallel using multiple threads or processes (possibly in a distributed fashion). I know that skipping ahead is the standard way to get independent number streams for parallel sampling, but
or dropping the seed argument entirely and using the global state) to both simulate_engine and simulate_breaks is that it breaks encapsulation -- if I change what I do inside simulate_engine, it also effects the brakes. That's a little too contrived, IMO. In most such simulations, the different components interact with each other in the normal course of the simulation; that's why they are both joined together in the same simulation instead of being two separate runs. Unless if the components are being run across a process or thread boundary (a la dask below) where true nondeterminism comes into play, then I don't think you want these semi-independent streams. This seems to be the advice du jour from the agent-based modeling community. that isn't exposed in numpy.random, and setting distinct seeds seems like a reasonable alternative for scientific computing use cases. Forget about integer seeds. Those are for human convenience. If you're not jotting them down in your lab notebook in pen, you don't want an integer seed. What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this: def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams: https://github.com/bashtage/ng-numpy-randomstate -- Robert Kern
On Tue, May 17, 2016 at 4:49 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Tue, May 17, 2016 at 12:18 AM, Robert Kern <robert.kern@gmail.com>
On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <shoyer@gmail.com>
wrote:
1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1].
Can you clarify the use case here? I don't really know what you are doing here, but I'm pretty sure this is not the right approach.
Here's a contrived example. Suppose I've written a simulator for cars
wrote: that consists of a number of loosely connected components (e.g., an engine, brakes, etc.). The behavior of each component of our simulator is stochastic, but we want everything to be fully reproducible, so we need to use seeds or RandomState objects.
We might write our simulate_car function like the following:
def simulate_car(engine_config, brakes_config, seed=None): rs = np.random.RandomState(seed) engine = simulate_engine(engine_config, seed=rs.random_seed()) brakes = simulate_brakes(brakes_config, seed=rs.random_seed()) ...
The problem with passing the same RandomState object (either explicitly
or dropping the seed argument entirely and using the global state) to both simulate_engine and simulate_breaks is that it breaks encapsulation -- if I change what I do inside simulate_engine, it also effects the brakes.
That's a little too contrived, IMO. In most such simulations, the different components interact with each other in the normal course of the simulation; that's why they are both joined together in the same simulation instead of being two separate runs. Unless if the components are being run across a process or thread boundary (a la dask below) where true nondeterminism comes into play, then I don't think you want these semi-independent streams. This seems to be the advice du jour from the agent-based modeling community.
similar usecase where I had to switch to using several RandomStates In a Monte Carlo experiment with increasing sample size, I want two random variables, x, y, to have the same the same draws in the common initial observations. If I draw x and y sequentially, and then increase the number of observations for the simulation, then it completely changes the draws for second variable if they use a common RandomState. With separate random states, increasing from 1000 to 1200 observations, leaves the first 1000 draws unchanged. (This reduces the Monte Carlo noise for example when calculating the power of a hypothesis test as function of the sample size.) Josef
The dask use case is actually pretty different -- the intent is to create many random numbers in parallel using multiple threads or processes (possibly in a distributed fashion). I know that skipping ahead is the standard way to get independent number streams for parallel sampling, but that isn't exposed in numpy.random, and setting distinct seeds seems like a reasonable alternative for scientific computing use cases.
Forget about integer seeds. Those are for human convenience. If you're not jotting them down in your lab notebook in pen, you don't want an integer seed.
What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this:
def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs
Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams:
https://github.com/bashtage/ng-numpy-randomstate
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On May 17, 2016 1:50 AM, "Robert Kern" <robert.kern@gmail.com> wrote:
What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this:
def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) #
[...] dtype=np.uint32 under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be a method RandomState.random_state() that generates and returns a new child RandomState.
Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams:
... But unfortunately I'm not sure how to make my interface suggestion above work on top of one of these RNGs, because for RandomState.random_state you really want a tree of independent RNGs and the fancy new PRNGs only provide a single flat namespace :-/. And even more annoyingly, the tree API is actually a nicer API, because with a flat namespace you have to know up front about all possible RNGs your code will use, which is an unfortunate global coupling that makes it difficult to compose programs out of independent pieces, while the RandomState.random_state approach composes beautifully. Maybe there's some clever way to allocate a 64-bit namespace to make it look tree-like? I'm not sure 64 bits is really enough... -n
On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith <njs@pobox.com> wrote:
On May 17, 2016 1:50 AM, "Robert Kern" <robert.kern@gmail.com> wrote:
[...]
What you want is a function that returns many RandomState objects that
def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) #
are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this: dtype=np.uint32 under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be a method RandomState.random_state() that generates and returns a new child RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or guaranteed-independent streams. I would not encourage it.
Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams:
... But unfortunately I'm not sure how to make my interface suggestion above work on top of one of these RNGs, because for RandomState.random_state you really want a tree of independent RNGs and the fancy new PRNGs only provide a single flat namespace :-/. And even more annoyingly, the tree API is actually a nicer API, because with a flat namespace you have to know up front about all possible RNGs your code will use, which is an unfortunate global coupling that makes it difficult to compose programs out of independent pieces, while the RandomState.random_state approach composes beautifully. Maybe there's some clever way to allocate a 64-bit namespace to make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat state space. You are just getting the illusion of a tree by hoping that you never collide. You ought to think about precisely the same global coupling issues with MT19937 as you do with guaranteed-independent streams. Hope-and-prayer isn't really a substitute for properly engineering your problem. It's just a moral hazard to promote this method to the main API. -- Robert Kern
On Tue, May 17, 2016 at 10:41 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith <njs@pobox.com> wrote:
On May 17, 2016 1:50 AM, "Robert Kern" <robert.kern@gmail.com> wrote:
[...]
What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this:
def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be a method RandomState.random_state() that generates and returns a new child RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or guaranteed-independent streams. I would not encourage it.
Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams:
... But unfortunately I'm not sure how to make my interface suggestion above work on top of one of these RNGs, because for RandomState.random_state you really want a tree of independent RNGs and the fancy new PRNGs only provide a single flat namespace :-/. And even more annoyingly, the tree API is actually a nicer API, because with a flat namespace you have to know up front about all possible RNGs your code will use, which is an unfortunate global coupling that makes it difficult to compose programs out of independent pieces, while the RandomState.random_state approach composes beautifully. Maybe there's some clever way to allocate a 64-bit namespace to make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat state space. You are just getting the illusion of a tree by hoping that you never collide. You ought to think about precisely the same global coupling issues with MT19937 as you do with guaranteed-independent streams. Hope-and-prayer isn't really a substitute for properly engineering your problem. It's just a moral hazard to promote this method to the main API.
Nonsense. If your definition of "hope and prayer" includes assuming that we won't encounter a random collision in a 2**19937 state space, then literally all engineering is hope-and-prayer. A collision could happen, but if it does it's overwhelmingly more likely to happen because of a flaw in the mathematical analysis, or a bug in the implementation, or because random quantum fluctuations caused you and your program to suddenly be transported to a parallel world where 1 + 1 = 1, than that you just got unlucky with your random state. And all of these hazards apply equally to both MT19937 and more modern PRNGs. ...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API, so whether or not it turns out to be possible I think we should at least be allowed to have a discussion about whether there's some way to give it to them. It's not even 100% out of the question that we conclude that existing PRNGs are buggy because they don't take this use case into account -- it would be far from the first time that numpy found itself going beyond the limits of older numerical tools that weren't designed to build the kind of large composable systems that numpy gets used for. MT19937's state space is large enough that you could explicitly encode a "tree seed" into it, even if you don't trust the laws of probability -- e.g., you start with a RandomState with id [], then its children have id [0], [1], [2], ..., and their children have ids [0, 0], [0, 1], ..., [1, 0], ..., and you write these into the state (probably after sending them through some bit-mixing permutation), to guarantee non-collision. Or if you do trust the laws of probability, then the randomly-sample-a-PRNG approach is not 100% out of the question even for more modern PRNGs. For example, let's take a PRNG with a 64-bit stream id and a 64-bit state space per stream. Suppose that we know that in our application each PRNG will be used to draw 2**48 64-bit samples (= 2 pebibytes of random data), and that we will use 2**32 PRNGs (= total of 8 yobibytes of random data). If we randomly initialize each PRNG with a 128-bit (stream id, state) pair, then the chance that two of our streams will overlap is about the chance of getting a collision in an 80-bit state space (128 - 48) on 2**32 draws, which can be approximated[1] as 1 - exp(-(2**32)**2 / (2 * 2**80)) = 1 - exp(-2**64 / 2 ** 81) = 1 - exp(-1/2**17) = 7.6-chances-in-a-million To put this number in some kind of context, Sandia says that on 128 pebibyte clusters (which I assume is the kind of system you use for a simulation that needs multiple yobibytes of random data!) they see a double-bit (!) DRAM fault ~every 4 minutes [2]. Also, by the time we've drawn 2**48 samples from a single 64-bit stream than we've already violated independence because in 2**48 draws you should see a repeated value with probability ~1.0 (by the same birthday approximation as above), but we see them with probability 0. Alternatively, for a PRNG with a 256 bit state space, or 128-bit stream id + 128-bit state space, then one can draw 2**64 64-bit samples without violating independence, *and* do this for, say, 2**70 randomly initialized streams, and the probability of collision would still be on the order of 10**-16. Unless I totally messed up these calculations, I'd conclude that almost everyone would be totally fine with a .random_state() method combined with any reasonable generator, and for some reasonable generators (those with more than, like, ~192 bits of state?) it's just unconditionally safe for all physically realistic problem sizes. -n [1] https://en.wikipedia.org/wiki/Birthday_problem#Approximations [2] http://www.fiala.me/pubs/papers/sc12-redmpi.pdf -- Nathaniel J. Smith -- https://vorpus.org
On Wed, May 18, 2016 at 1:14 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, May 17, 2016 at 10:41 AM, Robert Kern <robert.kern@gmail.com>
On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith <njs@pobox.com> wrote:
On May 17, 2016 1:50 AM, "Robert Kern" <robert.kern@gmail.com> wrote:
[...]
What you want is a function that returns many RandomState objects
are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this:
def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be a method RandomState.random_state() that generates and returns a new child RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or guaranteed-independent streams. I would not encourage it.
Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams:
... But unfortunately I'm not sure how to make my interface suggestion above work on top of one of these RNGs, because for RandomState.random_state you really want a tree of independent RNGs and the fancy new PRNGs only provide a single flat namespace :-/. And even more annoyingly, the
wrote: that tree API
is actually a nicer API, because with a flat namespace you have to know up front about all possible RNGs your code will use, which is an unfortunate global coupling that makes it difficult to compose programs out of independent pieces, while the RandomState.random_state approach composes beautifully. Maybe there's some clever way to allocate a 64-bit namespace to make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat state space. You are just getting the illusion of a tree by hoping that you never collide. You ought to think about precisely the same global coupling issues with MT19937 as you do with guaranteed-independent streams. Hope-and-prayer isn't really a substitute for properly engineering your problem. It's just a moral hazard to promote this method to the main API.
Nonsense.
If your definition of "hope and prayer" includes assuming that we won't encounter a random collision in a 2**19937 state space, then literally all engineering is hope-and-prayer. A collision could happen, but if it does it's overwhelmingly more likely to happen because of a flaw in the mathematical analysis, or a bug in the implementation, or because random quantum fluctuations caused you and your program to suddenly be transported to a parallel world where 1 + 1 = 1, than that you just got unlucky with your random state. And all of these hazards apply equally to both MT19937 and more modern PRNGs.
Granted.
...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API,
I remain unconvinced on this mark. Grumpily.
so whether or not it turns out to be possible I think we should at least be allowed to have a discussion about whether there's some way to give it to them.
I'm not shutting down discussion of the option. I *implemented* the option. I think that discussing whether it should be part of the main API is premature. There probably ought to be a paper or three out there supporting its safety and utility first. Let the utility function version flourish first.
It's not even 100% out of the question that we conclude that existing PRNGs are buggy because they don't take this use case into account -- it would be far from the first time that numpy found itself going beyond the limits of older numerical tools that weren't designed to build the kind of large composable systems that numpy gets used for.
MT19937's state space is large enough that you could explicitly encode a "tree seed" into it, even if you don't trust the laws of probability -- e.g., you start with a RandomState with id [], then its children have id [0], [1], [2], ..., and their children have ids [0, 0], [0, 1], ..., [1, 0], ..., and you write these into the state (probably after sending them through some bit-mixing permutation), to guarantee non-collision.
Sure. Not entirely sure if that can be done without preallocating the branching factor or depth, but I'm sure there's some fancy combinatoric representation of an unbounded tree that could be exploited here. It seems likely to me that such a thing could be done with the stream IDs of PRNGs that support that.
Or if you do trust the laws of probability, then the randomly-sample-a-PRNG approach is not 100% out of the question even for more modern PRNGs.
Tread carefully here. PRNGs are not RNGs. Using the root PRNG to generate N new seeds for the same PRNG does not necessarily generate N good, uncorrelated streams with high probability. Overlap is not the only factor in play. Long-term correlations happen even in the case where the N streams do not overlap, and the structure of the root PRNG could well generate N such correlated seeds. http://www.adv-radio-sci.net/12/75/2014/ See section 4.4. It does look like one can get good MT19937 streams with sequential integer seeds, when expanded by the standard sub-PRNG that we use for that purpose (yay!). However, if you use a different (but otherwise good-quality and general purpose) sub-PRNG to expand the key, you get increasing numbers of failures as you increase the number of parallel streams. It is not explored in the paper exactly why this is the case, but I suspect that the similarity of the second sub-PRNG algorithm to that of the MT itself is part of the problem. It is *likely* that the scheme I implemented will be okay (our array-seed expansion algorithm is similar to the integer-seed expansion and *probably* insulates the sprouted MTs from their MT generating source), but that remains to be demonstrated. -- Robert Kern
...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want to accomplish something anyway :-) I trust you guys to get this right -- or at least document what's "wrong" with it. But, if I'm reading the use case that started all this correctly, it closely matches my use-case. That is, I have a complex model with multiple independent "random" processes. And we want to be able to re-produce EXACTLY simulations -- our users get confused when the results are "different" even if in a statistically insignificant way. At the moment we are using one RNG, with one seed for everything. So we get reproducible results, but if one thing is changed, then the entire simulation is different -- which is OK, but it would be nicer to have each process using its own RNG stream with it's own seed. However, it matters not one whit if those seeds are independent -- the processes are different, you'd never notice if they were using the same PRN stream -- because they are used differently. So a "fairly low probability of a clash" would be totally fine. Granted, in a Monte Carlo simulation, it could be disastrous... :-) I guess the point is -- do something reasonable, and document its limitations, and we're all fine :-) And thanks for giving your attention to this. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want to accomplish something anyway :-) I trust you guys to get this right -- or at least document what's "wrong" with it.
But, if I'm reading the use case that started all this correctly, it closely matches my use-case. That is, I have a complex model with multiple independent "random" processes. And we want to be able to re-produce EXACTLY simulations -- our users get confused when the results are "different" even if in a statistically insignificant way.
At the moment we are using one RNG, with one seed for everything. So we get reproducible results, but if one thing is changed, then the entire simulation is different -- which is OK, but it would be nicer to have each
On Wed, May 18, 2016 at 4:50 PM, Chris Barker <chris.barker@noaa.gov> wrote: process using its own RNG stream with it's own seed. However, it matters not one whit if those seeds are independent -- the processes are different, you'd never notice if they were using the same PRN stream -- because they are used differently. So a "fairly low probability of a clash" would be totally fine. Well, the main question is: do you need to be able to spawn dependent streams at arbitrary points to an arbitrary depth without coordination between processes? The necessity for multiple independent streams per se is not contentious. -- Robert Kern
On Wed, May 18, 2016 at 12:01 PM, Robert Kern <robert.kern@gmail.com> wrote:
...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want to accomplish something anyway :-) I trust you guys to get this right -- or at least document what's "wrong" with it.
But, if I'm reading the use case that started all this correctly, it closely matches my use-case. That is, I have a complex model with multiple independent "random" processes. And we want to be able to re-produce EXACTLY simulations -- our users get confused when the results are "different" even if in a statistically insignificant way.
At the moment we are using one RNG, with one seed for everything. So we get reproducible results, but if one thing is changed, then the entire simulation is different -- which is OK, but it would be nicer to have each
On Wed, May 18, 2016 at 4:50 PM, Chris Barker <chris.barker@noaa.gov> wrote: process using its own RNG stream with it's own seed. However, it matters not one whit if those seeds are independent -- the processes are different, you'd never notice if they were using the same PRN stream -- because they are used differently. So a "fairly low probability of a clash" would be totally fine.
Well, the main question is: do you need to be able to spawn dependent streams at arbitrary points to an arbitrary depth without coordination between processes? The necessity for multiple independent streams per se is not contentious.
I'm similar to Chris, and didn't try to figure out the details of what you are talking about. However, if there are functions getting into numpy that help in using a best practice even if it's not bullet proof, then it's still better than home made approaches. If it get's in soon, then we can use it in a few years (given dependency lag). At that point there should be more distributed, nested simulation based algorithms where we don't know in advance how far we have to go to get reliable numbers or convergence. (But I don't see anything like that right now.) Josef
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, May 18, 2016 at 6:20 PM, <josef.pktd@gmail.com> wrote:
On Wed, May 18, 2016 at 12:01 PM, Robert Kern <robert.kern@gmail.com>
On Wed, May 18, 2016 at 4:50 PM, Chris Barker <chris.barker@noaa.gov>
wrote:
...anyway, the real reason I'm a bit grumpy is because there are
solid
engineering reasons why users *want* this API,
Honestly, I am lost in the math -- but like any good engineer, I want to accomplish something anyway :-) I trust you guys to get this right -- or at least document what's "wrong" with it.
But, if I'm reading the use case that started all this correctly, it closely matches my use-case. That is, I have a complex model with multiple independent "random" processes. And we want to be able to re-produce EXACTLY simulations -- our users get confused when the results are "different" even if in a statistically insignificant way.
At the moment we are using one RNG, with one seed for everything. So we get reproducible results, but if one thing is changed, then the entire simulation is different -- which is OK, but it would be nicer to have each
Well, the main question is: do you need to be able to spawn dependent
streams at arbitrary points to an arbitrary depth without coordination between processes? The necessity for multiple independent streams per se is not contentious.
I'm similar to Chris, and didn't try to figure out the details of what you are talking about.
However, if there are functions getting into numpy that help in using a best practice even if it's not bullet proof, then it's still better than home made approaches. If it get's in soon, then we can use it in a few years (given dependency lag). At that point there should be more distributed, nested simulation
wrote: process using its own RNG stream with it's own seed. However, it matters not one whit if those seeds are independent -- the processes are different, you'd never notice if they were using the same PRN stream -- because they are used differently. So a "fairly low probability of a clash" would be totally fine. based algorithms where we don't know in advance how far we have to go to get reliable numbers or convergence.
(But I don't see anything like that right now.)
Current best practice is to use PRNGs with settable streams (or fixed jumpahead for those PRNGs cursed to not have settable streams but blessed to have super-long periods). The way to get those into numpy is to help Kevin Sheppard finish: https://github.com/bashtage/ng-numpy-randomstate He's done nearly all of the hard work already. -- Robert Kern
On Wed, May 18, 2016 at 5:07 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, May 18, 2016 at 1:14 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, May 17, 2016 at 10:41 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith <njs@pobox.com> wrote:
On May 17, 2016 1:50 AM, "Robert Kern" <robert.kern@gmail.com> wrote:
[...]
What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this:
def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs
Maybe a nice way to encapsulate this in the RandomState interface would be a method RandomState.random_state() that generates and returns a new child RandomState.
I disagree. This is a workaround in the absence of proper jumpahead or guaranteed-independent streams. I would not encourage it.
Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams:
... But unfortunately I'm not sure how to make my interface suggestion above work on top of one of these RNGs, because for RandomState.random_state you really want a tree of independent RNGs and the fancy new PRNGs only provide a single flat namespace :-/. And even more annoyingly, the tree API is actually a nicer API, because with a flat namespace you have to know up front about all possible RNGs your code will use, which is an unfortunate global coupling that makes it difficult to compose programs out of independent pieces, while the RandomState.random_state approach composes beautifully. Maybe there's some clever way to allocate a 64-bit namespace to make it look tree-like? I'm not sure 64 bits is really enough...
MT19937 doesn't have a "tree" any more than the others. It's the same flat state space. You are just getting the illusion of a tree by hoping that you never collide. You ought to think about precisely the same global coupling issues with MT19937 as you do with guaranteed-independent streams. Hope-and-prayer isn't really a substitute for properly engineering your problem. It's just a moral hazard to promote this method to the main API.
Nonsense.
If your definition of "hope and prayer" includes assuming that we won't encounter a random collision in a 2**19937 state space, then literally all engineering is hope-and-prayer. A collision could happen, but if it does it's overwhelmingly more likely to happen because of a flaw in the mathematical analysis, or a bug in the implementation, or because random quantum fluctuations caused you and your program to suddenly be transported to a parallel world where 1 + 1 = 1, than that you just got unlucky with your random state. And all of these hazards apply equally to both MT19937 and more modern PRNGs.
Granted.
...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API,
I remain unconvinced on this mark. Grumpily.
Sorry for getting grumpy :-). The engineering reasons seem pretty obvious to me though? If you have any use case for independent streams at all, and you're writing code that's intended to live inside a library's abstraction barrier, then you need some way to choose your streams to avoid colliding with arbitrary other code that the end-user might assemble alongside yours as part of their final program. So AFAICT you have two options: either you need a "tree-style" API for allocating these streams, or else you need to add some explicit API to your library that lets the end-user control in detail which streams you use. Both are possible, but the latter is obviously undesireable if you can avoid it, since it breaks the abstraction barrier, making your library more complicated to use and harder to evolve.
so whether or not it turns out to be possible I think we should at least be allowed to have a discussion about whether there's some way to give it to them.
I'm not shutting down discussion of the option. I *implemented* the option. I think that discussing whether it should be part of the main API is premature. There probably ought to be a paper or three out there supporting its safety and utility first. Let the utility function version flourish first.
OK -- I guess this particularly makes sense given how extra-tightly-constrained we currently are in fixing mistakes in np.random. But I feel like in the end the right place for this really is inside the RandomState interface, because the person implementing RandomState is the one best placed to understand (a) the gnarly technical details here, and (b) how those change depending on the particular PRNG in use. I don't want to end up with a bunch of subtly-buggy utility functions in non-specialist libraries like dask -- so we should be trying to help downstream users figure out how to actually get this into np.random?
It's not even 100% out of the question that we conclude that existing PRNGs are buggy because they don't take this use case into account -- it would be far from the first time that numpy found itself going beyond the limits of older numerical tools that weren't designed to build the kind of large composable systems that numpy gets used for.
MT19937's state space is large enough that you could explicitly encode a "tree seed" into it, even if you don't trust the laws of probability -- e.g., you start with a RandomState with id [], then its children have id [0], [1], [2], ..., and their children have ids [0, 0], [0, 1], ..., [1, 0], ..., and you write these into the state (probably after sending them through some bit-mixing permutation), to guarantee non-collision.
Sure. Not entirely sure if that can be done without preallocating the branching factor or depth, but I'm sure there's some fancy combinatoric representation of an unbounded tree that could be exploited here. It seems likely to me that such a thing could be done with the stream IDs of PRNGs that support that.
I'm pretty sure you do have to preallocate both the branching factor and depth, since getting the collision-free guarantee requires that across the universe of possible tree addresses, each state id gets used at most once -- and there are finitely many state ids. But as a practical matter, saying "you can only sprout up to 2**32 new states out of any given state, and you can only nest them 600 deep, and exceeding these bounds is an error" would still be "composable enough" for ~all practical purposes.
Or if you do trust the laws of probability, then the randomly-sample-a-PRNG approach is not 100% out of the question even for more modern PRNGs.
Tread carefully here. PRNGs are not RNGs. Using the root PRNG to generate N new seeds for the same PRNG does not necessarily generate N good, uncorrelated streams with high probability. Overlap is not the only factor in play. Long-term correlations happen even in the case where the N streams do not overlap, and the structure of the root PRNG could well generate N such correlated seeds.
http://www.adv-radio-sci.net/12/75/2014/
See section 4.4. It does look like one can get good MT19937 streams with sequential integer seeds, when expanded by the standard sub-PRNG that we use for that purpose (yay!). However, if you use a different (but otherwise good-quality and general purpose) sub-PRNG to expand the key, you get increasing numbers of failures as you increase the number of parallel streams. It is not explored in the paper exactly why this is the case, but I suspect that the similarity of the second sub-PRNG algorithm to that of the MT itself is part of the problem.
It is *likely* that the scheme I implemented will be okay (our array-seed expansion algorithm is similar to the integer-seed expansion and *probably* insulates the sprouted MTs from their MT generating source), but that remains to be demonstrated.
Right... I think the way to think about this is to split it into two pieces. First, there's the issue that correlated streams exist at all. That they do for MT is annoying and closely related to why we prefer other PRNGs these days. When doing the analysis of the collision probability of randomly sprouted generators, this effectively reduces the size of the space and increases the risk of collision. For MT this by itself isn't a big deal because the space is so ludicrously large, but for the more modern PRNGs with smaller state spaces you definitely want these correlated streams not to exist at all. Fortunately I believe this is a design criterion that modern PRNGs take into account, but yeah, one needs to check this. (This kind of issue is also why I have a fondness for PRNGs designed on the "get a bigger hammer" principle, like AES-CTR :-).) Second, there's the issue that that paper talks about, where *if* correlated streams exist, then the structure of your PRNG might be such that when sampling a new seed from an old PRNG, you're unreasonably likely to hit one of them. (The most obvious example of this would be if you use an archaic PRNG design that works by outputting its internal state and then mixing it -- if you use the output of this PRNG to seed a new PRNG then the two will be identical modulo some lag!) If your sprouting procedure is subject to this kind of phenomenon, then it totally invalidates the use of simple probability theory to analyze the chance of collisions. But, I ignored this in my analysis because it's definitely solveable :-). All we need is some deterministic function that we can apply to the output of our original PRNG that will give us back something "effectively random" to use as our sprouted seed, totally destroying all correlations. And this problem of destroying correlations is well-studied in the crypto world (basically destroying these correlations *is* the entire problem of cryptographic primitive design), so even if we have doubts about the specific initialization functions usually used with MT, we have available to us lots of primitives that we're *very* confident will do a *very* thorough job, and sprouting isn't a performance sensitive operations so it doesn't matter if it takes a little longer using some crypto hash instead of the classic MT seeding algorithm. And then the simple birthday calculations are appropriate again. -n -- Nathaniel J. Smith -- https://vorpus.org
On Wed, May 18, 2016 at 7:56 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, May 18, 2016 at 5:07 AM, Robert Kern <robert.kern@gmail.com>
wrote:
On Wed, May 18, 2016 at 1:14 AM, Nathaniel Smith <njs@pobox.com> wrote:
...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API,
I remain unconvinced on this mark. Grumpily.
Sorry for getting grumpy :-).
And my apologies for some unwarranted hyperbole. I think we're both converging on a reasonable approach, though.
The engineering reasons seem pretty obvious to me though?
Well, I mean, engineers want lots of things. I suspect that most engineers *really* just want to call `numpy.random.seed(8675309)` at the start and never explicitly pass around separate streams. There's an upside to that in terms of code simplicity. There are also significant limitations and constraints. Ultimately, the upside against the alternative of passing around RandomState objects is usually overweighed by the limitations, so best practice is to pass around RandomState objects. I acknowledge that there exists an upside to the splitting API, but I don't think it's a groundbreaking improvement over the alternative current best practice. It's also unclear to me how often situations that really demonstrate the upside come into play; in my experience a lot of these situations are already structured such that preallocating N streams is the natural thing to do. The limitations and constraints are currently underexplored, IMO; and in this conservative field, pessimism is warranted.
If you have any use case for independent streams at all, and you're writing code that's intended to live inside a library's abstraction barrier, then you need some way to choose your streams to avoid colliding with arbitrary other code that the end-user might assemble alongside yours as part of their final program. So AFAICT you have two options: either you need a "tree-style" API for allocating these streams, or else you need to add some explicit API to your library that lets the end-user control in detail which streams you use. Both are possible, but the latter is obviously undesireable if you can avoid it, since it breaks the abstraction barrier, making your library more complicated to use and harder to evolve.
so whether or not it turns out to be possible I think we should at least be allowed to have a discussion about whether there's some way to give it to them.
I'm not shutting down discussion of the option. I *implemented* the
ACK option.
I think that discussing whether it should be part of the main API is premature. There probably ought to be a paper or three out there supporting its safety and utility first. Let the utility function version flourish first.
OK -- I guess this particularly makes sense given how extra-tightly-constrained we currently are in fixing mistakes in np.random. But I feel like in the end the right place for this really is inside the RandomState interface, because the person implementing RandomState is the one best placed to understand (a) the gnarly technical details here, and (b) how those change depending on the particular PRNG in use. I don't want to end up with a bunch of subtly-buggy utility functions in non-specialist libraries like dask -- so we should be trying to help downstream users figure out how to actually get this into np.random?
I think this is an open research area. An enterprising grad student could milk this for a couple of papers analyzing how to do this safely for a variety of PRNGs. I don't think we can hash this out in an email thread or PR. So yeah, eventually there might be an API on RandomState for this, but it's way too premature to do so right now, IMO. Maybe start with a specialized subclass of RandomState that adds this experimental API. In ng-numpy-randomstate. ;-) But if someone has spare time to work on numpy.random, for God's sake, use it to review @gfyoung's PRs instead.
It's not even 100% out of the question that we conclude that existing PRNGs are buggy because they don't take this use case into account -- it would be far from the first time that numpy found itself going beyond the limits of older numerical tools that weren't designed to build the kind of large composable systems that numpy gets used for.
MT19937's state space is large enough that you could explicitly encode a "tree seed" into it, even if you don't trust the laws of probability -- e.g., you start with a RandomState with id [], then its children have id [0], [1], [2], ..., and their children have ids [0, 0], [0, 1], ..., [1, 0], ..., and you write these into the state (probably after sending them through some bit-mixing permutation), to guarantee non-collision.
Sure. Not entirely sure if that can be done without preallocating the branching factor or depth, but I'm sure there's some fancy combinatoric representation of an unbounded tree that could be exploited here. It seems likely to me that such a thing could be done with the stream IDs of PRNGs that support that.
I'm pretty sure you do have to preallocate both the branching factor and depth, since getting the collision-free guarantee requires that across the universe of possible tree addresses, each state id gets used at most once -- and there are finitely many state ids. But as a practical matter, saying "you can only sprout up to 2**32 new states out of any given state, and you can only nest them 600 deep, and exceeding these bounds is an error" would still be "composable enough" for ~all practical purposes.
To be honest, I'd even be satisfied with taking the treepath breadcrumbs and hashing them with a suitable cryptographic hash to get a stream ID. For, say, PCG64 which has 2**127 settable streams, do sha256(treepath.asbytes()) % (2**127). I'm okay with spray-and-pray if it's using well-analyzed cryptographic primitives and enough settable streams. Subject to actual testing, of course. And that last point concerns me. PRNG quality testing is a dodgy subject as it is, and the state of the art for testing parallel streams is even worse. To be honest, making it more convenient to make new streams at a low library level makes me leery. I do think this is something that needs to be considered and designed at a high level, irrespective of the available APIs. -- Robert Kern
On Sun, May 22, 2016 at 2:35 AM, Robert Kern <robert.kern@gmail.com> wrote:
Well, I mean, engineers want lots of things. I suspect that most engineers *really* just want to call `numpy.random.seed(8675309)` at the start and never explicitly pass around separate streams. There's an upside to that in terms of code simplicity. There are also significant limitations and constraints. Ultimately, the upside against the alternative of passing around RandomState objects is usually overweighed by the limitations, so best practice is to pass around RandomState objects.
Could we do something like the logging module, and have numpy.random "manage" a bunch of stream objects for you -- so you could get the default single stream easily, and also get access to specific streams without needing to pass around the objects? That would allow folks to start off writing code the "easy" way -- then make it easier to refactor when they realize they need multiple independent streams. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, May 23, 2016 at 5:41 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Sun, May 22, 2016 at 2:35 AM, Robert Kern <robert.kern@gmail.com>
wrote:
Well, I mean, engineers want lots of things. I suspect that most
engineers *really* just want to call `numpy.random.seed(8675309)` at the start and never explicitly pass around separate streams. There's an upside to that in terms of code simplicity. There are also significant limitations and constraints. Ultimately, the upside against the alternative of passing around RandomState objects is usually overweighed by the limitations, so best practice is to pass around RandomState objects.
Could we do something like the logging module, and have numpy.random "manage" a bunch of stream objects for you -- so you could get the default single stream easily, and also get access to specific streams without needing to pass around the objects?
No, I don't think so. The logging module's namespacing doesn't really have an equivalent use case for PRNGs. We would just be making a much more complicated global state to manage. -- Robert Kern
Stephan Hoyer <shoyer@gmail.com> wrote:
I have recently encountered several use cases for randomly generate random number seeds:
1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after calling numpy.random.seed, but that do not want to use the functions in numpy.random directly. This came up recently in a pandas pull request [2], because we want to allow using RandomState objects as an alternative to global state in numpy.random. A major advantage of this approach is that it provides an obvious alternative to reusing the private numpy.random._mtrand [3].
What about making numpy.random a finite state machine, and keeping a stack of RandomState seeds? That is, something similar to what OpenGL does for its matrices? Then we get two functions, numpy.random.push_seed and numpy.random.pop_seed. Sturla
On Tue, May 17, 2016 at 2:40 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
Stephan Hoyer <shoyer@gmail.com> wrote:
I have recently encountered several use cases for randomly generate
number seeds:
1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after calling numpy.random.seed, but that do not want to use the functions in numpy.random directly. This came up recently in a pandas pull request [2], because we want to allow using RandomState objects as an alternative to global state in numpy.random. A major advantage of this approach is
random that it
provides an obvious alternative to reusing the private numpy.random._mtrand [3].
What about making numpy.random a finite state machine, and keeping a stack of RandomState seeds? That is, something similar to what OpenGL does for its matrices? Then we get two functions, numpy.random.push_seed and numpy.random.pop_seed.
I don't think that addresses the issues brought up here. It's just more global state to worry about. -- Robert Kern
On Tue, May 17, 2016 at 9:40 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
Stephan Hoyer <shoyer@gmail.com> wrote:
I have recently encountered several use cases for randomly generate random number seeds:
1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1].
2. When a library needs to produce results that are reproducible after calling numpy.random.seed, but that do not want to use the functions in numpy.random directly. This came up recently in a pandas pull request [2], because we want to allow using RandomState objects as an alternative to global state in numpy.random. A major advantage of this approach is that it provides an obvious alternative to reusing the private numpy.random._mtrand [3].
What about making numpy.random a finite state machine, and keeping a stack of RandomState seeds? That is, something similar to what OpenGL does for its matrices? Then we get two functions, numpy.random.push_seed and numpy.random.pop_seed.
I don't like the idea of adding this kind of internal state. Having it built into the module means that it is shared by all callers, libraries user code etc. That's not the right choice when a stack of seeds could be easily built around the RandomState object if that is really what someone needs. Eric
participants (7)
-
Chris Barker
-
Eric Moore
-
josef.pktd@gmail.com
-
Nathaniel Smith
-
Robert Kern
-
Stephan Hoyer
-
Sturla Molden