Moving NumPy's PRNG Forward
tl;dr: I think that our stream-compatibility policy is holding us back, and I think we can come up with a way forward with a new policy that will allow us to innovate without seriously compromising our reliability. To recap, our current policy for numpy.random is that we guarantee that the stream of random numbers from any of the methods of a seeded `RandomState` does not change from version to version (at least on the same hardware, OS, compiler, etc.), except in the case where we are fixing correctness bugs. That is, for code like this: prng = np.random.RandomState(12345) x = prng.normal(10.0, 3.0, size=100) `x` will be the same exact floats regardless of what version of numpy was installed. There seems to be a lot of pent-up motivation to improve on the random number generation, in particular the distributions, that has been blocked by our policy. I think we've lost a few potential first-time contributors that have run up against this wall. We have been pondering ways to allow for adding new core PRNGs and improve the distribution methods while maintaining stream-compatibility for existing code. Kevin Sheppard, in particular, has been working hard to implement new core PRNGs with a common API. https://github.com/bashtage/ng-numpy-randomstate Kevin has also been working to implement the several proposals that have been made to select different versions of distribution implementations. In particular, one idea is to pass something to the RandomState constructor to select a specific version of distributions (or switch out the core PRNG). Note that to satisfy the policy, the simplest method of seeding a RandomState will always give you the oldest version: what we have now. Kevin has recently come to the conclusion that it's not technically feasible to add the version-selection at all if we keep the stream-compatibility policy. https://github.com/numpy/numpy/pull/10124#issuecomment-350876221 I would argue that our current policy isn't providing the value that it claims to. In the first place, there are substantial holes in the reproducibility of streams across platforms. All of the integers (internal and external) are C `long`s, so integer overflows can cause variable streams if you use any of the rejection algorithms involving integers across Windows and Linux. Plain-old floating point arithmetic differences between platforms can cause similar issues (though rarer). Our policy of fixing bugs interferes with strict reproducibility. And our changes to non-random routines can interfere with the ability to reproduce the results of the whole software, independent of the PRNG stream. The multivariate normal implementation is even more vulnerable, as it uses `np.linalg` routines that may be affected by which LAPACK library numpy is built against much less changes that we might make to them in the normal course of development. At the time I established the policy (2008-9), there was significantly less tooling around for pinning versions of software. The PyPI/pip/setuptools ecosystem was in its infancy, VMs were slow cumbersome beasts mostly used to run Windows programs unavailable on Linux, and containerization a la Docker was merely a dream. A lot of resources have been put into reproducible research since then that pins the whole stack from OS libraries on up. The need to have stream-compatibility across numpy versions for the purpose of reproducible research is much diminished. I think that we can relax the strict stream-compatibility policy to allow innovation without giving up much practically-usable stability. Let's compare with Python's policy: https://docs.python.org/3.6/library/random.html#notes-on-reproducibility """ Most of the random module’s algorithms and seeding functions are subject to change across Python versions, but two aspects are guaranteed not to change: * If a new seeding method is added, then a backward compatible seeder will be offered. * The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed. """ I propose that we adopt a similar policy. This would immediately resolve many of the issues blocking innovation in the random distributions. Improvements to the distributions could be made at the same rhythm as normal features. No version-selection API would be required as you select the version by installing the desired version of numpy. By default, everyone gets the latest, best versions of the sampling algorithms. Selecting a different core PRNG could be easily achieved as ng-numpy-randomstate does it, by instantiating different classes. The different incompatible ways to initialize different core PRNGs (with unique features like selectable streams and the like) are transparently handled: different classes have different constructors. There is no need to jam all options for all core PRNGs into a single constructor. I would add a few more of the simpler distribution methods to the list that *is* guaranteed to remain stream-compatible, probably `randint()` and `bytes()` and maybe a couple of others. I would appreciate input on the matter. The current API should remain available and working, but not necessarily with the same algorithms. That is, for code like the following: prng = np.random.RandomState(12345) x = prng.normal(10.0, 3.0, size=100) `x` is still guaranteed to be 100 normal variates with the appropriate mean and standard deviation, but they might be computed by the ziggurat method from PCG-generated bytes (or whatever new default core PRNG we have). As an alternative, we may also want to leave `np.random.RandomState` entirely fixed in place as deprecated legacy code that is never updated. This would allow current unit tests that depend on the stream-compatibility that we previously promised to still pass until they decide to update. Development would move to a different class hierarchy with new names. I am personally not at all interested in preserving any stream compatibility for the `numpy.random.*` aliases or letting the user swap out the core PRNG for the global PRNG that underlies them. `np.random.seed()` should be discouraged (if not outright deprecated) in favor of explicitly passing around instances. In any case, we have a lot of different options to discuss if we decide to relax our stream-compatibility policy. At the moment, I'm not pushing for any particular changes to the code, just the policy in order to enable a more wide-ranging field of options that we have been able to work with so far. Thanks. -- Robert Kern
On Fri, Jan 19, 2018 at 9:55 AM, Robert Kern
tl;dr: I think that our stream-compatibility policy is holding us back, and I think we can come up with a way forward with a new policy that will allow us to innovate without seriously compromising our reliability.
To recap, our current policy for numpy.random is that we guarantee that the stream of random numbers from any of the methods of a seeded `RandomState` does not change from version to version (at least on the same hardware, OS, compiler, etc.), except in the case where we are fixing correctness bugs. That is, for code like this:
prng = np.random.RandomState(12345) x = prng.normal(10.0, 3.0, size=100)
`x` will be the same exact floats regardless of what version of numpy was installed.
There seems to be a lot of pent-up motivation to improve on the random number generation, in particular the distributions, that has been blocked by our policy. I think we've lost a few potential first-time contributors that have run up against this wall. We have been pondering ways to allow for adding new core PRNGs and improve the distribution methods while maintaining stream-compatibility for existing code. Kevin Sheppard, in particular, has been working hard to implement new core PRNGs with a common API.
https://github.com/bashtage/ng-numpy-randomstate
Kevin has also been working to implement the several proposals that have been made to select different versions of distribution implementations. In particular, one idea is to pass something to the RandomState constructor to select a specific version of distributions (or switch out the core PRNG). Note that to satisfy the policy, the simplest method of seeding a RandomState will always give you the oldest version: what we have now.
Kevin has recently come to the conclusion that it's not technically feasible to add the version-selection at all if we keep the stream-compatibility policy.
https://github.com/numpy/numpy/pull/10124#issuecomment-350876221
I would argue that our current policy isn't providing the value that it claims to. In the first place, there are substantial holes in the reproducibility of streams across platforms. All of the integers (internal and external) are C `long`s, so integer overflows can cause variable streams if you use any of the rejection algorithms involving integers across Windows and Linux. Plain-old floating point arithmetic differences between platforms can cause similar issues (though rarer). Our policy of fixing bugs interferes with strict reproducibility. And our changes to non-random routines can interfere with the ability to reproduce the results of the whole software, independent of the PRNG stream. The multivariate normal implementation is even more vulnerable, as it uses `np.linalg` routines that may be affected by which LAPACK library numpy is built against much less changes that we might make to them in the normal course of development.
At the time I established the policy (2008-9), there was significantly less tooling around for pinning versions of software. The PyPI/pip/setuptools ecosystem was in its infancy, VMs were slow cumbersome beasts mostly used to run Windows programs unavailable on Linux, and containerization a la Docker was merely a dream. A lot of resources have been put into reproducible research since then that pins the whole stack from OS libraries on up. The need to have stream-compatibility across numpy versions for the purpose of reproducible research is much diminished.
I think that we can relax the strict stream-compatibility policy to allow innovation without giving up much practically-usable stability. Let's compare with Python's policy:
https://docs.python.org/3.6/library/random.html#notes-on-reproducibility
""" Most of the random module’s algorithms and seeding functions are subject to change across Python versions, but two aspects are guaranteed not to change:
* If a new seeding method is added, then a backward compatible seeder will be offered. * The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed. """
I propose that we adopt a similar policy. This would immediately resolve many of the issues blocking innovation in the random distributions. Improvements to the distributions could be made at the same rhythm as normal features. No version-selection API would be required as you select the version by installing the desired version of numpy. By default, everyone gets the latest, best versions of the sampling algorithms. Selecting a different core PRNG could be easily achieved as ng-numpy-randomstate does it, by instantiating different classes. The different incompatible ways to initialize different core PRNGs (with unique features like selectable streams and the like) are transparently handled: different classes have different constructors. There is no need to jam all options for all core PRNGs into a single constructor.
I would add a few more of the simpler distribution methods to the list that *is* guaranteed to remain stream-compatible, probably `randint()` and `bytes()` and maybe a couple of others. I would appreciate input on the matter.
The current API should remain available and working, but not necessarily with the same algorithms. That is, for code like the following:
prng = np.random.RandomState(12345) x = prng.normal(10.0, 3.0, size=100)
`x` is still guaranteed to be 100 normal variates with the appropriate mean and standard deviation, but they might be computed by the ziggurat method from PCG-generated bytes (or whatever new default core PRNG we have).
As an alternative, we may also want to leave `np.random.RandomState` entirely fixed in place as deprecated legacy code that is never updated. This would allow current unit tests that depend on the stream-compatibility that we previously promised to still pass until they decide to update. Development would move to a different class hierarchy with new names.
I am personally not at all interested in preserving any stream compatibility for the `numpy.random.*` aliases or letting the user swap out the core PRNG for the global PRNG that underlies them. `np.random.seed()` should be discouraged (if not outright deprecated) in favor of explicitly passing around instances.
In any case, we have a lot of different options to discuss if we decide to relax our stream-compatibility policy. At the moment, I'm not pushing for any particular changes to the code, just the policy in order to enable a more wide-ranging field of options that we have been able to work with so far.
I'm not sure I fully understand Is the proposal to drop stream-backward compatibility completely for the future or just a one time change?
No version-selection API would be required as you select the version by installing the desired version of numpy.
That's not useful if we want to have unit tests that run in the same way across numpy versions. There are many unit tests that rely on fixed streams and have hard coded results that rely on specific numbers (up to floating point, numerical noise). Giving up stream compatibility would essentially kill using np.random for these unit tests. Similar, reproducibility from another user, e.g. in notebooks, would break without stream compatibility across numpy versions. One possibility is to keep the current stream-compatible np.random version and maintain it in future for those usecases, and add a new "high-performance" version with the new features. Josef
Thanks.
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, Jan 20, 2018 at 2:27 AM,
I'm not sure I fully understand Is the proposal to drop stream-backward compatibility completely for the future or just a one time change?
No version-selection API would be required as you select the version by installing the desired version of numpy.
That's not useful if we want to have unit tests that run in the same way across numpy versions.
There are many unit tests that rely on fixed streams and have hard coded results that rely on specific numbers (up to floating point, numerical noise). Giving up stream compatibility would essentially kill using np.random for
For all future. these unit tests. This is a use case that I am sensitive to. However, it should be noted that relying on the exact stream for unit tests makes you vulnerable to platform differences. That's something that we've never guaranteed (because we can't). That said, there are some of the simpler distributions that are more robust to such things, and those are fairly typical in unit tests. As I mentioned, I am open to a small set of methods that we do guarantee stream-compatibility for. I think that unit tests are the obvious use case that should determine what that set is. Unit tests rarely need `noncentral_chisquare()`, for instance. I'd also be willing to make the API a little clunkier in order to maintain the stable set of methods. For example, two methods that are common in unit testing are `normal()` and `choice()`, but those have been the target of the most attempted innovation. I'd be willing to leave them alone while providing other methods that do the same thing but are allowed to innovate.
Similar, reproducibility from another user, e.g. in notebooks, would break without stream compatibility across numpy versions.
That is the reproducible-research use case that I discussed already. I argued that the stability that our policy actually provides is rather more muted than what it seems on its face.
One possibility is to keep the current stream-compatible np.random version and maintain it in future for those usecases, and add a new "high-performance" version with the new features.
That is one of the alternatives I raised. -- Robert Kern
On Sat, Jan 20, 2018 at 7:34 AM, Robert Kern
On Sat, Jan 20, 2018 at 2:27 AM,
wrote: I'm not sure I fully understand Is the proposal to drop stream-backward compatibility completely for
the future or just a one time change?
For all future.
To color this a little, while we'll have a burst of activity for the first round to fix all of the mistakes I baked in early, I do not expect the pace of change after that to be very large. While we are going to relax the strictness of the policy, we should carefully weigh the benefit of a change against the pain of changing the stream, and explore ways to implement the improvement that would retain the stream. We should take more care making such changes in `np.random` than in other parts of numpy. When we get to drafting the NEP, I'd be happy to include language to this effect. -- Robert Kern
On Fri, Jan 19, 2018 at 6:57 AM Robert Kern
As an alternative, we may also want to leave `np.random.RandomState` entirely fixed in place as deprecated legacy code that is never updated. This would allow current unit tests that depend on the stream-compatibility that we previously promised to still pass until they decide to update. Development would move to a different class hierarchy with new names.
I like this alternative, but I would hesitate to call it "deprecated". Users who care about exact reproducibility across NumPy versions (e.g., for testing) are probably less concerned about performance, and could continue to use it. New random number generator classes could implement their own guarantees about compatibility across their methods. I am personally not at all interested in preserving any stream
compatibility for the `numpy.random.*` aliases or letting the user swap out the core PRNG for the global PRNG that underlies them. `np.random.seed()` should be discouraged (if not outright deprecated) in favor of explicitly passing around instances.
I agree that np.random.seed() should be discouraged, but it feels very late in NumPy's development to remove it. If we do alter the random number streams for numpy.random.*, it seems that we should probably issue a warning (at least for a several major versions) whenever numpy.random.seed() is called. This could get pretty noisy. I guess that's all the more incentive to switch to random state objects.
On Sat, Jan 20, 2018 at 2:57 AM, Stephan Hoyer
On Fri, Jan 19, 2018 at 6:57 AM Robert Kern
wrote: As an alternative, we may also want to leave `np.random.RandomState`
entirely fixed in place as deprecated legacy code that is never updated. This would allow current unit tests that depend on the stream-compatibility that we previously promised to still pass until they decide to update. Development would move to a different class hierarchy with new names.
I like this alternative, but I would hesitate to call it "deprecated".
New random number generator classes could implement their own guarantees about compatibility across their methods.
I am personally not at all interested in preserving any stream compatibility for the `numpy.random.*` aliases or letting the user swap out
Users who care about exact reproducibility across NumPy versions (e.g., for testing) are probably less concerned about performance, and could continue to use it. I would be careful about that because quite a few of the methods are not stable across platforms, even on the same numpy version. If you want to declare that some part of the np.random API is stable for such purposes, we need to curate a subset of the methods anyways. As a one-off thing, this alternative proposes to declare that all of `np.random.RandomState` is stable across versions, but we can't guarantee that all of it is unconditionally stable for exact reproducibility. We can make a guarantee for a smaller subset of methods, though. To your point, though, if we freeze the current `RandomState`, we can make that guarantee for a larger subset of the methods than we would for the new API. So I guess I talked myself around to your view, but I would be a bit more cautious in how we advertise the stability of the frozen `RandomState` API. the core PRNG for the global PRNG that underlies them. `np.random.seed()` should be discouraged (if not outright deprecated) in favor of explicitly passing around instances.
I agree that np.random.seed() should be discouraged, but it feels very
late in NumPy's development to remove it.
If we do alter the random number streams for numpy.random.*, it seems
that we should probably issue a warning (at least for a several major versions) whenever numpy.random.seed() is called. This could get pretty noisy. I guess that's all the more incentive to switch to random state objects. True. I like that. The reason I think that it might be worth an exception is that it has been a moral hazard. People aren't just writing correct but improvable code (relying on `np.random.*` methods but seeding exactly once at the start of their single-threaded simulation) but they've been writing incorrect and easily-broken code. For example: np.random.seed(seed) np.random.shuffle(x_train) np.random.seed(seed) np.random.shuffle(labels_train) -- Robert Kern
On Fri, Jan 19, 2018 at 6:55 AM, Robert Kern
There seems to be a lot of pent-up motivation to improve on the random number generation, in particular the distributions, that has been blocked by our policy. I think we've lost a few potential first-time contributors that have run up against this wall. We have been pondering ways to allow for adding new core PRNGs and improve the distribution methods while maintaining stream-compatibility for existing code. Kevin Sheppard, in particular, has been working hard to implement new core PRNGs with a common API.
https://github.com/bashtage/ng-numpy-randomstate
Kevin has also been working to implement the several proposals that have been made to select different versions of distribution implementations. In particular, one idea is to pass something to the RandomState constructor to select a specific version of distributions (or switch out the core PRNG). Note that to satisfy the policy, the simplest method of seeding a RandomState will always give you the oldest version: what we have now.
Kevin has recently come to the conclusion that it's not technically feasible to add the version-selection at all if we keep the stream-compatibility policy.
https://github.com/numpy/numpy/pull/10124#issuecomment-350876221
I would argue that our current policy isn't providing the value that it claims to.
I agree that relaxing our policy would be better than the status quo. Before making any decisions, though, I'd like to make sure we understand the alternatives and their trade-offs. Specifically, I think the main alternative would be the following approach to versioning: 1) make RandomState's state be a tuple (underlying RNG algorithm, underlying RNG state, distribution version) 2) zero-argument initialization/seeding, like RandomState() or rstate.seed(), sets the state to: (our recommended RNG algorithm, os.urandom(...), version=LATEST_VERSION) 3) for backcompat, single-argument seeding like RandomState(123) or rstate.seed(123), sets the state to: (mersenne twister, expand_mt_seed(123), version=0) 4) also allow seeding to explicitly control all the parameters, like RandomState(PCG_XSL_RR(123), version=12) or whatever 5) the distribution functions are implemented like: def normal(*args, **kwargs): if self.version < 3: return self._normal_box_muller(*args, **kwargs) elif self.version < 8: return self._normal_ziggurat_v1(*args, **kwargs) else: # version >= 8 return self._normal_ziggurat_v2(*args, **kwargs) Advantages: fully backwards compatible; preserves the compatibility guarantee (such as it is); users who use the default seeding automatically get the highest speed and quality Disadvantages: users who specify seeds explicitly get old/slow distributions (but of course that's the point of compatibility); we have to keep the old distribution code around forever (but this is not too hard; it just sits in some function and we never touch it). Kevin, is this the version that you think is non-viable? Is the above a good description of the advantages/disadvantages? -n -- Nathaniel J. Smith -- https://vorpus.org
On Fri, Jan 19, 2018 at 6:13 PM Nathaniel Smith
...
I agree that relaxing our policy would be better than the status quo. Before making any decisions, though, I'd like to make sure we understand the alternatives and their trade-offs. Specifically, I think the main alternative would be the following approach to versioning:
1) make RandomState's state be a tuple (underlying RNG algorithm, underlying RNG state, distribution version) 2) zero-argument initialization/seeding, like RandomState() or rstate.seed(), sets the state to: (our recommended RNG algorithm, os.urandom(...), version=LATEST_VERSION) 3) for backcompat, single-argument seeding like RandomState(123) or rstate.seed(123), sets the state to: (mersenne twister, expand_mt_seed(123), version=0) 4) also allow seeding to explicitly control all the parameters, like RandomState(PCG_XSL_RR(123), version=12) or whatever 5) the distribution functions are implemented like:
def normal(*args, **kwargs): if self.version < 3: return self._normal_box_muller(*args, **kwargs) elif self.version < 8: return self._normal_ziggurat_v1(*args, **kwargs) else: # version >= 8 return self._normal_ziggurat_v2(*args, **kwargs)
I like this suggestion, but I suggest to modify it so that a zero-argument initialization or 1-argument seeding will initialize to a global value, which would default to backcompat, but could be changed. Then my old code would by default produce the same old results, but adding 1 line at the top switches to faster code if I want.
participants (5)
-
josef.pktd@gmail.com
-
Nathaniel Smith
-
Neal Becker
-
Robert Kern
-
Stephan Hoyer