[scikit-learn] NEP: Random Number Generator Policy
Robert Kern
robert.kern at gmail.com
Sat Jun 16 20:29:33 EDT 2018
On 6/16/18 05:54, josef.pktd at gmail.com wrote:
> On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern at gmail.com> wrote:
>> I have made a significant revision. In this version, downstream projects
>> like scikit-learn should experience significantly less forced churn.
>>
>> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst
>>
>> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
>>
>> tl;dr RandomState lives! But its distributions are forever frozen. So maybe
>> "undead" is more apt. Anyways, RandomState will continue to provide the same
>> stream-compatibility that it always has. But it will be internally
>> refactored to use the same core uniform PRNG objects that the new
>> RandomGenerator distributions class will use underneath (defaulting to the
>> current Mersenne Twister, of course). The distribution methods on
>> RandomGenerator will be allowed to evolve with numpy versions and get
>> better/faster implementations.
>>
>> Your code can mix the usage of RandomState and RandomGenerator as needed,
>> but they can be made to share the same underlying RNG algorithm's state.
>
>
> Sounds good to me, and I think handles all our concerns.
>
> I also think that the issues behind the np.random.* section about the
> global state and seed can be revisited for possible deprecation of
> convenience features.
>
> One clarifying question, mainly to see IIUC
>
> in this quote
> """
> Calling numpy.random.seed() thereafter SHOULD just pass the given seed
> to the current basic RNG object and not attempt to reset the basic RNG
> to the Mersenne Twister. The global RandomState instance MUST be
> accessible by the name numpy.random.mtrand._rand
> """
>
> "the current basic RNG object" refers to the global object. AFAIU, it
> is possible to change it numpy.random.mtrand._rand. Is it?
numpy.random.mtrand._rand would not be a basic RNG object; it would be (as it is
now) a RandomState instance. "the current basic RNG object" would be the basic
RNG that that global RandomState instance is currently using.
It is not possible (now or in the glorious NEP future) to assign a new instance
to numpy.random.mtrand._rand. All of the numpy.random.* functions are actually
just simple aliases to the methods on that object when the module is first
built. Re-assigning _rand wouldn't reassign those aliases.
numpy.random.standard_normal(), for instance, would still be the
.standard_normal() method on the RandomState instance that _rand initially
pointed to.
Currently and under the NEP, the only way to modify numpy.random.mtrand._rand is
to call its methods (i.e. the numpy.random.* convenience functions) to modify
its internal state. That's not changing.
The only thing that will change will be that there will be a new numpy.random.*
function to call that will let you give the global RandomState a new basic RNG
object that it will swap in internally. Let's call it
np.random.swap_global_basic_rng(). If you don't use that function, you won't
have a problem. I intend to make this new function *very* explicit about what it
is doing, and document the crap out of it so it won't be misused like
np.random.seed() is.
> I never tried that so I didn't know we can change the global
> RandomState, and thought we will have to replace np.random.seed usage > with a specific RandomState(seed) instance
I did a quick review of np.random.seed() usage in statsmodels, and I think you
are mostly fine. It looks like you mostly use it in unit tests and at the top of
examples. The only possible problem that I can see that you might have with the
swap_global_basic_rng() is if some other package that you rely on calls it in
its library code. Then subsequent statsmodels unit tests might fail because when
they call np.random.seed(), it won't be reseeding a Mersenne Twister but another
basic RNG.
However, I intend to make that a weird and unnatural thing to do. It's already
unlikely to happen as it's a niche requirement that one mostly would need at the
start of a whole program, not buried down inside library code. But we will also
document that function to discourage such usage, and probably have unconditional
noisy warnings that users would have to explicitly silence.
If one of your dependencies did that, you'd be well within your rights to tell
them that they are misusing numpy and causing breakage in statsmodels.
> In loose analogy:
>
> Matplotlib has a "global" current figure and axis, gca, gcf.
> In statsmodels we avoid any access to and usage of it and only work
> with individual figure/axis instances that can be provided by the
> user. (except for maybe some documentation examples and maybe some
> "legacy" code.)
> ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48
> )
>
> AFAICS, essentially, statsmodels will need a similar policy for
> RandomState/RandomGenerator and give up the usage of the global random
> instance.
I mean, you certainly *should* (outside of unit tests) for very similar reasons
why you avoid the global state in matplotlib, but this NEP won't force you to.
You should do so anyways under the status quo, too. For any of your functions
that call np.random.* functions internally, it's hard to use them in threaded
applications, for instance, because it is relying on that global state.
scikit-learn's check_random_state() is a good pattern to follow.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the scikit-learn
mailing list