NEP: Random Number Generator Policy
https://mail.python.org/pipermail/numpy-discussion/2018-June/078126.html Hi, sklearners! I have a NEP out for discussion that proposes a change in numpy.random's stream-compatibility policy. As scikit-learn is a well-disciplined consumer of reproducible streams, I would appreciate your input on the numpy-discussion thread linked above. The very short form is that there is a new PRNG subsystem being developed with better core PRNGs (among other things, providing nice features like independent streams for parallel computations), and we would like to relax our strict stream-compatibility policy for the non-uniform distributions in this new subsystem so that we can improve our algorithms. The core uniform numbers would still be strictly stream-compatible across numpy versions. But we would like to be able to upgrade our non-uniform algorithms, for example, to make normal variates faster to generate. RandomState would be frozen and subject to a long deprecation cycle for a period of strict backwards compatibility. There would be some non-deprecated provision to get strictly-compatible streams for a subset of distributions for the limited purpose of generating test data for unit tests. Please read the NEP and the thread through. I do propose at least one alternative in the thread and would like some feedback on it. I would also appreciate it if we could consolidate the discussion on the numpy-discussion thread and not have a split-off conversation here too. Thank you very much! I appreciate your attention. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
I have made a significant revision. In this version, downstream projects like scikit-learn should experience significantly less forced churn. https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-... https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html tl;dr RandomState lives! But its distributions are forever frozen. So maybe "undead" is more apt. Anyways, RandomState will continue to provide the same stream-compatibility that it always has. But it will be internally refactored to use the same core uniform PRNG objects that the new RandomGenerator distributions class will use underneath (defaulting to the current Mersenne Twister, of course). The distribution methods on RandomGenerator will be allowed to evolve with numpy versions and get better/faster implementations. Your code can mix the usage of RandomState and RandomGenerator as needed, but they can be made to share the same underlying RNG algorithm's state. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern@gmail.com> wrote:
I have made a significant revision. In this version, downstream projects like scikit-learn should experience significantly less forced churn.
https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-...
https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
tl;dr RandomState lives! But its distributions are forever frozen. So maybe "undead" is more apt. Anyways, RandomState will continue to provide the same stream-compatibility that it always has. But it will be internally refactored to use the same core uniform PRNG objects that the new RandomGenerator distributions class will use underneath (defaulting to the current Mersenne Twister, of course). The distribution methods on RandomGenerator will be allowed to evolve with numpy versions and get better/faster implementations.
Your code can mix the usage of RandomState and RandomGenerator as needed, but they can be made to share the same underlying RNG algorithm's state.
Sounds good to me, and I think handles all our concerns. I also think that the issues behind the np.random.* section about the global state and seed can be revisited for possible deprecation of convenience features. One clarifying question, mainly to see IIUC in this quote """ Calling numpy.random.seed() thereafter SHOULD just pass the given seed to the current basic RNG object and not attempt to reset the basic RNG to the Mersenne Twister. The global RandomState instance MUST be accessible by the name numpy.random.mtrand._rand """ "the current basic RNG object" refers to the global object. AFAIU, it is possible to change it numpy.random.mtrand._rand. Is it? I never tried that so I didn't know we can change the global RandomState, and thought we will have to replace np.random.seed usage with a specific RandomState(seed) instance In loose analogy: Matplotlib has a "global" current figure and axis, gca, gcf. In statsmodels we avoid any access to and usage of it and only work with individual figure/axis instances that can be provided by the user. (except for maybe some documentation examples and maybe some "legacy" code.) ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/... ) AFAICS, essentially, statsmodels will need a similar policy for RandomState/RandomGenerator and give up the usage of the global random instance. Josef
-- Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On 6/16/18 05:54, josef.pktd@gmail.com wrote:
On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern@gmail.com> wrote:
I have made a significant revision. In this version, downstream projects like scikit-learn should experience significantly less forced churn.
https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-...
https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
tl;dr RandomState lives! But its distributions are forever frozen. So maybe "undead" is more apt. Anyways, RandomState will continue to provide the same stream-compatibility that it always has. But it will be internally refactored to use the same core uniform PRNG objects that the new RandomGenerator distributions class will use underneath (defaulting to the current Mersenne Twister, of course). The distribution methods on RandomGenerator will be allowed to evolve with numpy versions and get better/faster implementations.
Your code can mix the usage of RandomState and RandomGenerator as needed, but they can be made to share the same underlying RNG algorithm's state.
Sounds good to me, and I think handles all our concerns.
I also think that the issues behind the np.random.* section about the global state and seed can be revisited for possible deprecation of convenience features.
One clarifying question, mainly to see IIUC
in this quote """ Calling numpy.random.seed() thereafter SHOULD just pass the given seed to the current basic RNG object and not attempt to reset the basic RNG to the Mersenne Twister. The global RandomState instance MUST be accessible by the name numpy.random.mtrand._rand """
"the current basic RNG object" refers to the global object. AFAIU, it is possible to change it numpy.random.mtrand._rand. Is it?
numpy.random.mtrand._rand would not be a basic RNG object; it would be (as it is now) a RandomState instance. "the current basic RNG object" would be the basic RNG that that global RandomState instance is currently using. It is not possible (now or in the glorious NEP future) to assign a new instance to numpy.random.mtrand._rand. All of the numpy.random.* functions are actually just simple aliases to the methods on that object when the module is first built. Re-assigning _rand wouldn't reassign those aliases. numpy.random.standard_normal(), for instance, would still be the .standard_normal() method on the RandomState instance that _rand initially pointed to. Currently and under the NEP, the only way to modify numpy.random.mtrand._rand is to call its methods (i.e. the numpy.random.* convenience functions) to modify its internal state. That's not changing. The only thing that will change will be that there will be a new numpy.random.* function to call that will let you give the global RandomState a new basic RNG object that it will swap in internally. Let's call it np.random.swap_global_basic_rng(). If you don't use that function, you won't have a problem. I intend to make this new function *very* explicit about what it is doing, and document the crap out of it so it won't be misused like np.random.seed() is.
I never tried that so I didn't know we can change the global RandomState, and thought we will have to replace np.random.seed usage > with a specific RandomState(seed) instance
I did a quick review of np.random.seed() usage in statsmodels, and I think you are mostly fine. It looks like you mostly use it in unit tests and at the top of examples. The only possible problem that I can see that you might have with the swap_global_basic_rng() is if some other package that you rely on calls it in its library code. Then subsequent statsmodels unit tests might fail because when they call np.random.seed(), it won't be reseeding a Mersenne Twister but another basic RNG. However, I intend to make that a weird and unnatural thing to do. It's already unlikely to happen as it's a niche requirement that one mostly would need at the start of a whole program, not buried down inside library code. But we will also document that function to discourage such usage, and probably have unconditional noisy warnings that users would have to explicitly silence. If one of your dependencies did that, you'd be well within your rights to tell them that they are misusing numpy and causing breakage in statsmodels.
In loose analogy:
Matplotlib has a "global" current figure and axis, gca, gcf. In statsmodels we avoid any access to and usage of it and only work with individual figure/axis instances that can be provided by the user. (except for maybe some documentation examples and maybe some "legacy" code.) ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/... )
AFAICS, essentially, statsmodels will need a similar policy for RandomState/RandomGenerator and give up the usage of the global random instance.
I mean, you certainly *should* (outside of unit tests) for very similar reasons why you avoid the global state in matplotlib, but this NEP won't force you to. You should do so anyways under the status quo, too. For any of your functions that call np.random.* functions internally, it's hard to use them in threaded applications, for instance, because it is relying on that global state. scikit-learn's check_random_state() is a good pattern to follow. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Sat, Jun 16, 2018 at 8:29 PM, Robert Kern <robert.kern@gmail.com> wrote:
On 6/16/18 05:54, josef.pktd@gmail.com wrote:
On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern@gmail.com> wrote:
I have made a significant revision. In this version, downstream projects like scikit-learn should experience significantly less forced churn.
https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-...
https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
tl;dr RandomState lives! But its distributions are forever frozen. So maybe "undead" is more apt. Anyways, RandomState will continue to provide the same stream-compatibility that it always has. But it will be internally refactored to use the same core uniform PRNG objects that the new RandomGenerator distributions class will use underneath (defaulting to the current Mersenne Twister, of course). The distribution methods on RandomGenerator will be allowed to evolve with numpy versions and get better/faster implementations.
Your code can mix the usage of RandomState and RandomGenerator as needed, but they can be made to share the same underlying RNG algorithm's state.
Sounds good to me, and I think handles all our concerns.
I also think that the issues behind the np.random.* section about the global state and seed can be revisited for possible deprecation of convenience features.
One clarifying question, mainly to see IIUC
in this quote """ Calling numpy.random.seed() thereafter SHOULD just pass the given seed to the current basic RNG object and not attempt to reset the basic RNG to the Mersenne Twister. The global RandomState instance MUST be accessible by the name numpy.random.mtrand._rand """
"the current basic RNG object" refers to the global object. AFAIU, it is possible to change it numpy.random.mtrand._rand. Is it?
numpy.random.mtrand._rand would not be a basic RNG object; it would be (as it is now) a RandomState instance. "the current basic RNG object" would be the basic RNG that that global RandomState instance is currently using.
It is not possible (now or in the glorious NEP future) to assign a new instance to numpy.random.mtrand._rand. All of the numpy.random.* functions are actually just simple aliases to the methods on that object when the module is first built. Re-assigning _rand wouldn't reassign those aliases. numpy.random.standard_normal(), for instance, would still be the .standard_normal() method on the RandomState instance that _rand initially pointed to.
Currently and under the NEP, the only way to modify numpy.random.mtrand._rand is to call its methods (i.e. the numpy.random.* convenience functions) to modify its internal state. That's not changing.
The only thing that will change will be that there will be a new numpy.random.* function to call that will let you give the global RandomState a new basic RNG object that it will swap in internally. Let's call it np.random.swap_global_basic_rng(). If you don't use that function, you won't have a problem. I intend to make this new function *very* explicit about what it is doing, and document the crap out of it so it won't be misused like np.random.seed() is.
I didn't catch that part. Now it's clear.
I never tried that so I didn't know we can change the global RandomState, and thought we will have to replace np.random.seed usage > with a specific RandomState(seed) instance
I did a quick review of np.random.seed() usage in statsmodels, and I think you are mostly fine. It looks like you mostly use it in unit tests and at the top of examples. The only possible problem that I can see that you might have with the swap_global_basic_rng() is if some other package that you rely on calls it in its library code. Then subsequent statsmodels unit tests might fail because when they call np.random.seed(), it won't be reseeding a Mersenne Twister but another basic RNG.
However, I intend to make that a weird and unnatural thing to do. It's already unlikely to happen as it's a niche requirement that one mostly would need at the start of a whole program, not buried down inside library code. But we will also document that function to discourage such usage, and probably have unconditional noisy warnings that users would have to explicitly silence.
If one of your dependencies did that, you'd be well within your rights to tell them that they are misusing numpy and causing breakage in statsmodels.
In loose analogy:
Matplotlib has a "global" current figure and axis, gca, gcf. In statsmodels we avoid any access to and usage of it and only work with individual figure/axis instances that can be provided by the user. (except for maybe some documentation examples and maybe some "legacy" code.) ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/... )
AFAICS, essentially, statsmodels will need a similar policy for RandomState/RandomGenerator and give up the usage of the global random instance.
I mean, you certainly *should* (outside of unit tests) for very similar reasons why you avoid the global state in matplotlib, but this NEP won't force you to. You should do so anyways under the status quo, too. For any of your functions that call np.random.* functions internally, it's hard to use them in threaded applications, for instance, because it is relying on that global state.
scikit-learn's check_random_state() is a good pattern to follow.
Thanks for the clarification. I just realized that I had replied to scikit-learn mailing list. I had thought this was numpy-discussion. sorry about that. Josef
-- Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On 6/16/18 00:59, Robert Kern wrote:
I have made a significant revision. In this version, downstream projects like scikit-learn should experience significantly less forced churn.
https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-...
https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
The screaming has died down on numpy-discussion, and it seems like everyone who has participated over there has more or less come to consensus about accepting this NEP. However, I'd really appreciate it if I could get some kind of feedback from a scikit-learn dev, whether it's "I don't care" or "I need a couple of days to get around to reading the NEP" or just "+1" or "-1000; this is awful!" I'm not picky. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
I don't think I have the bandwidth but I agree :-/ Not sure if any of the other core devs do. I can try to read it next week but that's probably too late? On 06/19/2018 02:34 AM, Robert Kern wrote:
On 6/16/18 00:59, Robert Kern wrote:
I have made a significant revision. In this version, downstream projects like scikit-learn should experience significantly less forced churn.
https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-...
https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
The screaming has died down on numpy-discussion, and it seems like everyone who has participated over there has more or less come to consensus about accepting this NEP. However, I'd really appreciate it if I could get some kind of feedback from a scikit-learn dev, whether it's "I don't care" or "I need a couple of days to get around to reading the NEP" or just "+1" or "-1000; this is awful!"
I'm not picky.
On 6/19/18 08:12, Andreas Mueller wrote:
I don't think I have the bandwidth but I agree :-/ Not sure if any of the other core devs do. I can try to read it next week but that's probably too late?
We're not on a deadline. If you're interested in reading the NEP and providing feedback/consent, I'm happy to hold off on formally accepting the NEP until then. Thanks! -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On 6/19/18 15:19, Robert Kern wrote:
On 6/19/18 08:12, Andreas Mueller wrote:
I don't think I have the bandwidth but I agree :-/ Not sure if any of the other core devs do. I can try to read it next week but that's probably too late?
We're not on a deadline. If you're interested in reading the NEP and providing feedback/consent, I'm happy to hold off on formally accepting the NEP until then.
I just made a deadline. :-) I formally proposed acceptance of the NEP. In 7 days, if no one objects, it will be formally marked as Accepted. https://mail.python.org/pipermail/numpy-discussion/2018-July/078380.html -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Mon, Jun 18, 2018 at 11:34:38PM -0700, Robert Kern wrote:
However, I'd really appreciate it if I could get some kind of feedback from a scikit-learn dev,
I didn't read the NEP, only your summary. That said, it seems quite reasonably aligned with our practice, and hence shouldn't pose a problem. Ideally, I believe that in the long run it should enable us to have cleaner / more robust code, but I suspect that it will take a while before we get there. Gaël
participants (4)
-
Andreas Mueller -
Gael Varoquaux -
josef.pktd@gmail.com -
Robert Kern