[Numpy-discussion] Proposal: numpy.random.random_seed

Wed May 18 14:56:17 EDT 2016

On Wed, May 18, 2016 at 5:07 AM, Robert Kern <robert.kern at gmail.com> wrote:
> On Wed, May 18, 2016 at 1:14 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>> On Tue, May 17, 2016 at 10:41 AM, Robert Kern <robert.kern at gmail.com>
>> wrote:
>> > On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> >>
>> >> On May 17, 2016 1:50 AM, "Robert Kern" <robert.kern at gmail.com> wrote:
>> >> >
>> >> [...]
>> >> > What you want is a function that returns many RandomState objects
>> >> > that
>> >> > are hopefully spread around the MT19937 space enough that they are
>> >> > essentially independent (in the absence of true jumpahead). The
>> >> > better
>> >> > implementation of such a function would look something like this:
>> >> >
>> >> > def spread_out_prngs(n, root_prng=None):
>> >> >     if root_prng is None:
>> >> >         root_prng = np.random
>> >> >     elif not isinstance(root_prng, np.random.RandomState):
>> >> >         root_prng = np.random.RandomState(root_prng)
>> >> >     sprouted_prngs = []
>> >> >     for i in range(n):
>> >> >         seed_array = root_prng.randint(1<<32, size=624)  #
>> >> > dtype=np.uint32 under 1.11
>> >> >         sprouted_prngs.append(np.random.RandomState(seed_array))
>> >> >     return spourted_prngs
>> >>
>> >> Maybe a nice way to encapsulate this in the RandomState interface would
>> >> be
>> >> a method RandomState.random_state() that generates and returns a new
>> >> child
>> >> RandomState.
>> >
>> > I disagree. This is a workaround in the absence of proper jumpahead or
>> > guaranteed-independent streams. I would not encourage it.
>> >
>> >> > Internally, this generates seed arrays of about the size of the
>> >> > MT19937
>> >> > state so make sure that you can access more of the state space. That
>> >> > will at
>> >> > least make the chance of collision tiny. And it can be easily
>> >> > rewritten to
>> >> > take advantage of one of the newer PRNGs that have true independent
>> >> > streams:
>> >> >
>> >> >   https://github.com/bashtage/ng-numpy-randomstate
>> >>
>> >> ... But unfortunately I'm not sure how to make my interface suggestion
>> >> above work on top of one of these RNGs, because for
>> >> RandomState.random_state
>> >> you really want a tree of independent RNGs and the fancy new PRNGs only
>> >> provide a single flat namespace :-/. And even more annoyingly, the tree
>> >> API
>> >> is actually a nicer API, because with a flat namespace you have to know
>> >> up
>> >> front about all possible RNGs your code will use, which is an
>> >> unfortunate
>> >> global coupling that makes it difficult to compose programs out of
>> >> independent pieces, while the RandomState.random_state approach
>> >> composes
>> >> beautifully. Maybe there's some clever way to allocate a 64-bit
>> >> namespace to
>> >> make it look tree-like? I'm not sure 64 bits is really enough...
>> >
>> > MT19937 doesn't have a "tree" any more than the others. It's the same
>> > flat
>> > state space. You are just getting the illusion of a tree by hoping that
>> > you
>> > never collide. You ought to think about precisely the same global
>> > coupling
>> > issues with MT19937 as you do with guaranteed-independent streams.
>> > Hope-and-prayer isn't really a substitute for properly engineering your
>> > problem. It's just a moral hazard to promote this method to the main
>> > API.
>>
>> Nonsense.
>>
>> If your definition of "hope and prayer" includes assuming that we
>> won't encounter a random collision in a 2**19937 state space, then
>> literally all engineering is hope-and-prayer. A collision could
>> happen, but if it does it's overwhelmingly more likely to happen
>> because of a flaw in the mathematical analysis, or a bug in the
>> implementation, or because random quantum fluctuations caused you and
>> your program to suddenly be transported to a parallel world where 1 +
>> 1 = 1, than that you just got unlucky with your random state. And all
>> of these hazards apply equally to both MT19937 and more modern PRNGs.
>
> Granted.
>
>> ...anyway, the real reason I'm a bit grumpy is because there are solid
>> engineering reasons why users *want* this API,
>
> I remain unconvinced on this mark. Grumpily.

Sorry for getting grumpy :-). The engineering reasons seem pretty
obvious to me though? If you have any use case for independent streams
at all, and you're writing code that's intended to live inside a
library's abstraction barrier, then you need some way to choose your
streams to avoid colliding with arbitrary other code that the end-user
might assemble alongside yours as part of their final program. So
AFAICT you have two options: either you need a "tree-style" API for
allocating these streams, or else you need to add some explicit API to
your library that lets the end-user control in detail which streams
you use. Both are possible, but the latter is obviously undesireable
if you can avoid it, since it breaks the abstraction barrier, making
your library more complicated to use and harder to evolve.

>> so whether or not it
>> turns out to be possible I think we should at least be allowed to have
>> a discussion about whether there's some way to give it to them.
>
> I'm not shutting down discussion of the option. I *implemented* the option.
> I think that discussing whether it should be part of the main API is
> premature. There probably ought to be a paper or three out there supporting
> its safety and utility first. Let the utility function version flourish
> first.

OK -- I guess this particularly makes sense given how
extra-tightly-constrained we currently are in fixing mistakes in
np.random. But I feel like in the end the right place for this really
is inside the RandomState interface, because the person implementing
RandomState is the one best placed to understand (a) the gnarly
technical details here, and (b) how those change depending on the
particular PRNG in use. I don't want to end up with a bunch of
subtly-buggy utility functions in non-specialist libraries like dask
-- so we should be trying to help downstream users figure out how to
actually get this into np.random?

>> It's
>> not even 100% out of the question that we conclude that existing PRNGs
>> are buggy because they don't take this use case into account -- it
>> would be far from the first time that numpy found itself going beyond
>> the limits of older numerical tools that weren't designed to build the
>> kind of large composable systems that numpy gets used for.
>>
>> MT19937's state space is large enough that you could explicitly encode
>> a "tree seed" into it, even if you don't trust the laws of probability
>> -- e.g., you start with a RandomState with id [], then its children
>> have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
>> 1], ..., [1, 0], ..., and you write these into the state (probably
>> after sending them through some bit-mixing permutation), to guarantee
>> non-collision.
>
> Sure. Not entirely sure if that can be done without preallocating the
> branching factor or depth, but I'm sure there's some fancy combinatoric
> representation of an unbounded tree that could be exploited here. It seems
> likely to me that such a thing could be done with the stream IDs of PRNGs
> that support that.

I'm pretty sure you do have to preallocate both the branching factor
and depth, since getting the collision-free guarantee requires that
across the universe of possible tree addresses, each state id gets
used at most once -- and there are finitely many state ids. But as a
practical matter, saying "you can only sprout up to 2**32 new states
out of any given state, and you can only nest them 600 deep, and
exceeding these bounds is an error" would still be "composable enough"
for ~all practical purposes.

>> Or if you do trust the laws of probability, then the
>> randomly-sample-a-PRNG approach is not 100% out of the question even
>> for more modern PRNGs.
>
> Tread carefully here. PRNGs are not RNGs. Using the root PRNG to generate N
> new seeds for the same PRNG does not necessarily generate N good,
> uncorrelated streams with high probability. Overlap is not the only factor
> in play. Long-term correlations happen even in the case where the N streams
> do not overlap, and the structure of the root PRNG could well generate N
> such correlated seeds.
>
> http://www.adv-radio-sci.net/12/75/2014/
>
> See section 4.4. It does look like one can get good MT19937 streams with
> sequential integer seeds, when expanded by the standard sub-PRNG that we use
> for that purpose (yay!). However, if you use a different (but otherwise
> good-quality and general purpose) sub-PRNG to expand the key, you get
> increasing numbers of failures as you increase the number of parallel
> streams. It is not explored in the paper exactly why this is the case, but I
> suspect that the similarity of the second sub-PRNG algorithm to that of the
> MT itself is part of the problem.
>
> It is *likely* that the scheme I implemented will be okay (our array-seed
> expansion algorithm is similar to the integer-seed expansion and *probably*
> insulates the sprouted MTs from their MT generating source), but that
> remains to be demonstrated.

Right... I think the way to think about this is to split it into two
pieces. First, there's the issue that correlated streams exist at all.
That they do for MT is annoying and closely related to why we prefer
other PRNGs these days. When doing the analysis of the collision
probability of randomly sprouted generators, this effectively reduces
the size of the space and increases the risk of collision. For MT this
by itself isn't a big deal because the space is so ludicrously large,
but for the more modern PRNGs with smaller state spaces you definitely
want these correlated streams not to exist at all. Fortunately I
believe this is a design criterion that modern PRNGs take into
account, but yeah, one needs to check this. (This kind of issue is
also why I have a fondness for PRNGs designed on the "get a bigger
hammer" principle, like AES-CTR :-).)

Second, there's the issue that that paper talks about, where *if*
correlated streams exist, then the structure of your PRNG might be
such that when sampling a new seed from an old PRNG, you're
unreasonably likely to hit one of them. (The most obvious example of
this would be if you use an archaic PRNG design that works by
outputting its internal state and then mixing it -- if you use the
output of this PRNG to seed a new PRNG then the two will be identical
modulo some lag!) If your sprouting procedure is subject to this kind
of phenomenon, then it totally invalidates the use of simple
probability theory to analyze the chance of collisions.

But, I ignored this in my analysis because it's definitely solveable
:-). All we need is some deterministic function that we can apply to
the output of our original PRNG that will give us back something
"effectively random" to use as our sprouted seed, totally destroying
all correlations. And this problem of destroying correlations is
well-studied in the crypto world (basically destroying these
correlations *is* the entire problem of cryptographic primitive
design), so even if we have doubts about the specific initialization
functions usually used with MT, we have available to us lots of
primitives that we're *very* confident will do a *very* thorough job,
and sprouting isn't a performance sensitive operations so it doesn't
matter if it takes a little longer using some crypto hash instead of
the classic MT seeding algorithm. And then the simple birthday
calculations are appropriate again.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org