[Numpy-discussion] Moving NumPy's PRNG Forward

Robert Kern robert.kern at gmail.com
Fri Jan 19 09:55:57 EST 2018


tl;dr: I think that our stream-compatibility policy is holding us back, and
I think we can come up with a way forward with a new policy that will allow
us to innovate without seriously compromising our reliability.

To recap, our current policy for numpy.random is that we guarantee that the
stream of random numbers from any of the methods of a seeded `RandomState`
does not change from version to version (at least on the same hardware, OS,
compiler, etc.), except in the case where we are fixing correctness bugs.
That is, for code like this:

  prng = np.random.RandomState(12345)
  x = prng.normal(10.0, 3.0, size=100)

`x` will be the same exact floats regardless of what version of numpy was
installed.

There seems to be a lot of pent-up motivation to improve on the random
number generation, in particular the distributions, that has been blocked
by our policy. I think we've lost a few potential first-time contributors
that have run up against this wall. We have been pondering ways to allow
for adding new core PRNGs and improve the distribution methods while
maintaining stream-compatibility for existing code. Kevin Sheppard, in
particular, has been working hard to implement new core PRNGs with a common
API.

  https://github.com/bashtage/ng-numpy-randomstate

Kevin has also been working to implement the several proposals that have
been made to select different versions of distribution implementations. In
particular, one idea is to pass something to the RandomState constructor to
select a specific version of distributions (or switch out the core PRNG).
Note that to satisfy the policy, the simplest method of seeding a
RandomState will always give you the oldest version: what we have now.

Kevin has recently come to the conclusion that it's not technically
feasible to add the version-selection at all if we keep the
stream-compatibility policy.

  https://github.com/numpy/numpy/pull/10124#issuecomment-350876221

I would argue that our current policy isn't providing the value that it
claims to. In the first place, there are substantial holes in the
reproducibility of streams across platforms. All of the integers (internal
and external) are C `long`s, so integer overflows can cause variable
streams if you use any of the rejection algorithms involving integers
across Windows and Linux. Plain-old floating point arithmetic differences
between platforms can cause similar issues (though rarer). Our policy of
fixing bugs interferes with strict reproducibility. And our changes to
non-random routines can interfere with the ability to reproduce the results
of the whole software, independent of the PRNG stream. The multivariate
normal implementation is even more vulnerable, as it uses `np.linalg`
routines that may be affected by which LAPACK library numpy is built
against much less changes that we might make to them in the normal course
of development.

At the time I established the policy (2008-9), there was significantly less
tooling around for pinning versions of software. The PyPI/pip/setuptools
ecosystem was in its infancy, VMs were slow cumbersome beasts mostly used
to run Windows programs unavailable on Linux, and containerization a la
Docker was merely a dream. A lot of resources have been put into
reproducible research since then that pins the whole stack from OS
libraries on up. The need to have stream-compatibility across numpy
versions for the purpose of reproducible research is much diminished.

I think that we can relax the strict stream-compatibility policy to allow
innovation without giving up much practically-usable stability. Let's
compare with Python's policy:

  https://docs.python.org/3.6/library/random.html#notes-on-reproducibility

"""
Most of the random module’s algorithms and seeding functions are subject to
change across Python versions, but two aspects are guaranteed not to change:

* If a new seeding method is added, then a backward compatible seeder will
be offered.
* The generator’s random() method will continue to produce the same
sequence when the compatible seeder is given the same seed.
"""

I propose that we adopt a similar policy. This would immediately resolve
many of the issues blocking innovation in the random distributions.
Improvements to the distributions could be made at the same rhythm as
normal features. No version-selection API would be required as you select
the version by installing the desired version of numpy. By default,
everyone gets the latest, best versions of the sampling algorithms.
Selecting a different core PRNG could be easily achieved as
ng-numpy-randomstate does it, by instantiating different classes. The
different incompatible ways to initialize different core PRNGs (with unique
features like selectable streams and the like) are transparently handled:
different classes have different constructors. There is no need to jam all
options for all core PRNGs into a single constructor.

I would add a few more of the simpler distribution methods to the list that
*is* guaranteed to remain stream-compatible, probably `randint()` and
`bytes()` and maybe a couple of others. I would appreciate input on the
matter.

The current API should remain available and working, but not necessarily
with the same algorithms. That is, for code like the following:

  prng = np.random.RandomState(12345)
  x = prng.normal(10.0, 3.0, size=100)

`x` is still guaranteed to be 100 normal variates with the appropriate mean
and standard deviation, but they might be computed by the ziggurat method
from PCG-generated bytes (or whatever new default core PRNG we have).

As an alternative, we may also want to leave `np.random.RandomState`
entirely fixed in place as deprecated legacy code that is never updated.
This would allow current unit tests that depend on the stream-compatibility
that we previously promised to still pass until they decide to update.
Development would move to a different class hierarchy with new names.

I am personally not at all interested in preserving any stream
compatibility for the `numpy.random.*` aliases or letting the user swap out
the core PRNG for the global PRNG that underlies them. `np.random.seed()`
should be discouraged (if not outright deprecated) in favor of explicitly
passing around instances.

In any case, we have a lot of different options to discuss if we decide to
relax our stream-compatibility policy. At the moment, I'm not pushing for
any particular changes to the code, just the policy in order to enable a
more wide-ranging field of options that we have been able to work with so
far.

Thanks.

-- 
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180119/c05155a4/attachment.html>


More information about the NumPy-Discussion mailing list