[Numpy-discussion] Moving NumPy's PRNG Forward

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Jan 19 12:27:10 EST 2018

On Fri, Jan 19, 2018 at 9:55 AM, Robert Kern <robert.kern at gmail.com> wrote:

> tl;dr: I think that our stream-compatibility policy is holding us back,
> and I think we can come up with a way forward with a new policy that will
> allow us to innovate without seriously compromising our reliability.
> To recap, our current policy for numpy.random is that we guarantee that
> the stream of random numbers from any of the methods of a seeded
> `RandomState` does not change from version to version (at least on the same
> hardware, OS, compiler, etc.), except in the case where we are fixing
> correctness bugs. That is, for code like this:
>   prng = np.random.RandomState(12345)
>   x = prng.normal(10.0, 3.0, size=100)
> `x` will be the same exact floats regardless of what version of numpy was
> installed.
> There seems to be a lot of pent-up motivation to improve on the random
> number generation, in particular the distributions, that has been blocked
> by our policy. I think we've lost a few potential first-time contributors
> that have run up against this wall. We have been pondering ways to allow
> for adding new core PRNGs and improve the distribution methods while
> maintaining stream-compatibility for existing code. Kevin Sheppard, in
> particular, has been working hard to implement new core PRNGs with a common
> API.
>   https://github.com/bashtage/ng-numpy-randomstate
> Kevin has also been working to implement the several proposals that have
> been made to select different versions of distribution implementations. In
> particular, one idea is to pass something to the RandomState constructor to
> select a specific version of distributions (or switch out the core PRNG).
> Note that to satisfy the policy, the simplest method of seeding a
> RandomState will always give you the oldest version: what we have now.
> Kevin has recently come to the conclusion that it's not technically
> feasible to add the version-selection at all if we keep the
> stream-compatibility policy.
>   https://github.com/numpy/numpy/pull/10124#issuecomment-350876221
> I would argue that our current policy isn't providing the value that it
> claims to. In the first place, there are substantial holes in the
> reproducibility of streams across platforms. All of the integers (internal
> and external) are C `long`s, so integer overflows can cause variable
> streams if you use any of the rejection algorithms involving integers
> across Windows and Linux. Plain-old floating point arithmetic differences
> between platforms can cause similar issues (though rarer). Our policy of
> fixing bugs interferes with strict reproducibility. And our changes to
> non-random routines can interfere with the ability to reproduce the results
> of the whole software, independent of the PRNG stream. The multivariate
> normal implementation is even more vulnerable, as it uses `np.linalg`
> routines that may be affected by which LAPACK library numpy is built
> against much less changes that we might make to them in the normal course
> of development.
> At the time I established the policy (2008-9), there was significantly
> less tooling around for pinning versions of software. The
> PyPI/pip/setuptools ecosystem was in its infancy, VMs were slow cumbersome
> beasts mostly used to run Windows programs unavailable on Linux, and
> containerization a la Docker was merely a dream. A lot of resources have
> been put into reproducible research since then that pins the whole stack
> from OS libraries on up. The need to have stream-compatibility across numpy
> versions for the purpose of reproducible research is much diminished.
> I think that we can relax the strict stream-compatibility policy to allow
> innovation without giving up much practically-usable stability. Let's
> compare with Python's policy:
>   https://docs.python.org/3.6/library/random.html#notes-on-reproducibility
> """
> Most of the random module’s algorithms and seeding functions are subject
> to change across Python versions, but two aspects are guaranteed not to
> change:
> * If a new seeding method is added, then a backward compatible seeder will
> be offered.
> * The generator’s random() method will continue to produce the same
> sequence when the compatible seeder is given the same seed.
> """
> I propose that we adopt a similar policy. This would immediately resolve
> many of the issues blocking innovation in the random distributions.
> Improvements to the distributions could be made at the same rhythm as
> normal features. No version-selection API would be required as you select
> the version by installing the desired version of numpy. By default,
> everyone gets the latest, best versions of the sampling algorithms.
> Selecting a different core PRNG could be easily achieved as
> ng-numpy-randomstate does it, by instantiating different classes. The
> different incompatible ways to initialize different core PRNGs (with unique
> features like selectable streams and the like) are transparently handled:
> different classes have different constructors. There is no need to jam all
> options for all core PRNGs into a single constructor.
> I would add a few more of the simpler distribution methods to the list
> that *is* guaranteed to remain stream-compatible, probably `randint()` and
> `bytes()` and maybe a couple of others. I would appreciate input on the
> matter.
> The current API should remain available and working, but not necessarily
> with the same algorithms. That is, for code like the following:
>   prng = np.random.RandomState(12345)
>   x = prng.normal(10.0, 3.0, size=100)
> `x` is still guaranteed to be 100 normal variates with the appropriate
> mean and standard deviation, but they might be computed by the ziggurat
> method from PCG-generated bytes (or whatever new default core PRNG we have).
> As an alternative, we may also want to leave `np.random.RandomState`
> entirely fixed in place as deprecated legacy code that is never updated.
> This would allow current unit tests that depend on the stream-compatibility
> that we previously promised to still pass until they decide to update.
> Development would move to a different class hierarchy with new names.
> I am personally not at all interested in preserving any stream
> compatibility for the `numpy.random.*` aliases or letting the user swap out
> the core PRNG for the global PRNG that underlies them. `np.random.seed()`
> should be discouraged (if not outright deprecated) in favor of explicitly
> passing around instances.
> In any case, we have a lot of different options to discuss if we decide to
> relax our stream-compatibility policy. At the moment, I'm not pushing for
> any particular changes to the code, just the policy in order to enable a
> more wide-ranging field of options that we have been able to work with so
> far.

I'm not sure I fully understand
Is the proposal to drop stream-backward compatibility completely for the
future or just a one time change?

> No version-selection API would be required as you select the version by
installing the desired version of numpy.

That's not useful if we want to have unit tests that run in the same way
across numpy versions.

There are many unit tests that rely on fixed streams and have hard coded
results that rely on specific numbers (up to floating point, numerical
Giving up stream compatibility would essentially kill using np.random for
these unit tests.

Similar, reproducibility from another user, e.g. in notebooks, would break
without stream compatibility across numpy versions.

One possibility is to keep  the current stream-compatible np.random version
and maintain it in future for those usecases, and add a new
"high-performance" version with the new features.


> Thanks.
> --
> Robert Kern
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180119/6cadacb4/attachment.html>

More information about the NumPy-Discussion mailing list