[scikit-learn] NEP: Random Number Generator Policy

Sat Jun 16 20:29:33 EDT 2018

On 6/16/18 05:54, josef.pktd at gmail.com wrote:
> On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern at gmail.com> wrote:
>> I have made a significant revision. In this version, downstream projects
>> like scikit-learn should experience significantly less forced churn.
>>
>> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst
>>
>> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
>>
>> tl;dr RandomState lives! But its distributions are forever frozen. So maybe
>> "undead" is more apt. Anyways, RandomState will continue to provide the same
>> stream-compatibility that it always has. But it will be internally
>> refactored to use the same core uniform PRNG objects that the new
>> RandomGenerator distributions class will use underneath (defaulting to the
>> current Mersenne Twister, of course). The distribution methods on
>> RandomGenerator will be allowed to evolve with numpy versions and get
>> better/faster implementations.
>>
>> Your code can mix the usage of RandomState and RandomGenerator as needed,
>> but they can be made to share the same underlying RNG algorithm's state.
> 
> 
> Sounds good to me, and I think handles all our concerns.
> 
> I also think that the issues behind the np.random.* section about the
> global state and seed can be revisited for possible deprecation of
> convenience features.
> 
> One clarifying question, mainly to see IIUC
> 
> in this quote
> """
> Calling numpy.random.seed() thereafter SHOULD just pass the given seed
> to the current basic RNG object and not attempt to reset the basic RNG
> to the Mersenne Twister. The global RandomState instance MUST be
> accessible by the name numpy.random.mtrand._rand
> """
> 
> "the current basic RNG object" refers to the global object. AFAIU, it
> is possible to change it numpy.random.mtrand._rand. Is it?

numpy.random.mtrand._rand would not be a basic RNG object; it would be (as it is 
now) a RandomState instance. "the current basic RNG object" would be the basic 
RNG that that global RandomState instance is currently using.

It is not possible (now or in the glorious NEP future) to assign a new instance 
to numpy.random.mtrand._rand. All of the numpy.random.* functions are actually 
just simple aliases to the methods on that object when the module is first 
built. Re-assigning _rand wouldn't reassign those aliases. 
numpy.random.standard_normal(), for instance, would still be the 
.standard_normal() method on the RandomState instance that _rand initially 
pointed to.

Currently and under the NEP, the only way to modify numpy.random.mtrand._rand is 
to call its methods (i.e. the numpy.random.* convenience functions) to modify 
its internal state. That's not changing.

The only thing that will change will be that there will be a new numpy.random.* 
function to call that will let you give the global RandomState a new basic RNG 
object that it will swap in internally. Let's call it 
np.random.swap_global_basic_rng(). If you don't use that function, you won't 
have a problem. I intend to make this new function *very* explicit about what it 
is doing, and document the crap out of it so it won't be misused like 
np.random.seed() is.

> I never tried that so I didn't know we can change the global
> RandomState, and thought we will have to replace np.random.seed usage > with a specific RandomState(seed) instance

I did a quick review of np.random.seed() usage in statsmodels, and I think you 
are mostly fine. It looks like you mostly use it in unit tests and at the top of 
examples. The only possible problem that I can see that you might have with the 
swap_global_basic_rng() is if some other package that you rely on calls it in 
its library code. Then subsequent statsmodels unit tests might fail because when 
they call np.random.seed(), it won't be reseeding a Mersenne Twister but another 
basic RNG.

However, I intend to make that a weird and unnatural thing to do. It's already 
unlikely to happen as it's a niche requirement that one mostly would need at the 
start of a whole program, not buried down inside library code. But we will also 
document that function to discourage such usage, and probably have unconditional 
noisy warnings that users would have to explicitly silence.

If one of your dependencies did that, you'd be well within your rights to tell 
them that they are misusing numpy and causing breakage in statsmodels.

> In loose analogy:
> 
> Matplotlib has a "global" current figure and axis, gca, gcf.
> In statsmodels we avoid any access to and usage of it and only work
> with individual figure/axis instances that can be provided by the
> user. (except for maybe some documentation examples and maybe some
> "legacy" code.)
> ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48
> )
> 
> AFAICS, essentially, statsmodels will need a similar policy for
> RandomState/RandomGenerator and give up the usage of the global random
> instance.

I mean, you certainly *should* (outside of unit tests) for very similar reasons 
why you avoid the global state in matplotlib, but this NEP won't force you to. 
You should do so anyways under the status quo, too. For any of your functions 
that call np.random.* functions internally, it's hard to use them in threaded 
applications, for instance, because it is relying on that global state.

scikit-learn's check_random_state() is a good pattern to follow.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco