Re: [Numpy-discussion] Moving NumPy's PRNG Forward

I am a firm believer that the current situation is not sustainable. There are a lot of improvements that can practically be incorporated. While many of these are performance related, there are also improvements in accuracy over some ranges of parameters that cannot be incorporated. I also think that perfect stream reproducibility is a bit of a myth across versions since this really would require identical OS, compiler and possibly CPU for some of the generators that produce floats.
I believe there is a case for separating the random generator from core NumPy. Some points that favor becoming a subproject:
1. It is a pure consumer of NumPy API. Other parts of the API do no depend on random. 2. A stand alone package could be installed along side many different version of core NumPy which would reduce the pressure on freezing the stream.
In terms of what is needed, I think that the underlying PRNG should be swappable. The will provide a simple mechanism to allow certain types of advancement while easily providing backward compat. In the current design this is very hard and requires compiling many nearly identical copies of RandomState. In pseudocode something like
standard_normal(prng)
where prng is a basic class that retains the PRNG state and has a small set of core random number generators that belong to the underlying PRNG -- probably something like int32, int64, double, and possibly int53. I am not advocating explicitly passing the PRNG as an argument, but having generators which can take any suitable PRNG would add a lot of flexibility in terms of taking advantage of improvements in the underlying PRNGs (see, e.g., xoroshiro128/xorshift1024). The "small" core PRNG would have responsibility over state and streams. The remainder of the module would transform the underlying PRNG into the required distributions.
This would also simplify making improvements, since old versions could be saved or improved versions could be added to the API. For example,
from numpy.random import standard_normal, prng # Preferred versions standard_normal(prng) # Ziggurat from numpy.random.legacy import standard_normal_bm, mt19937 # legacy generators standard_normal_bm(mt19937) # Box-Muller
Kevin

On Sat, Jan 27, 2018 at 1:14 AM, Kevin Sheppard kevin.k.sheppard@gmail.com wrote:
I am a firm believer that the current situation is not sustainable.
There are a lot of improvements that can practically be incorporated. While many of these are performance related, there are also improvements in accuracy over some ranges of parameters that cannot be incorporated. I also think that perfect stream reproducibility is a bit of a myth across versions since this really would require identical OS, compiler and possibly CPU for some of the generators that produce floats.
I believe there is a case for separating the random generator from core
NumPy. Some points that favor becoming a subproject:
- It is a pure consumer of NumPy API. Other parts of the API do no
depend on random.
- A stand alone package could be installed along side many different
version of core NumPy which would reduce the pressure on freezing the stream.
Removing numpy.random (or freezing it as deprecated legacy while all PRNG development moves elsewhere) is probably a non-starter. It's too used for us not to provide something. That said, we can (and ought to) make it much easier for external packages to provide PRNG capabilities (core PRNGs and distributions) that interoperate with the core functionality that numpy provides. I'm also happy to place a high barrier on adding more distributions to numpy.random once that is in place.
Specifically, core uniform PRNGs should have a small common C API that distribution functions can use. This might just be a struct with an opaque `void*` state pointer and then 2 function pointers for drawing a uint64 (whole range) and a double in [0,1) from the state. It's important to expose our core uniform PRNGs as a C API because there has been a desire to interoperate at that level, using the same PRNG state inside C or Fortran or GPU code. If that's in place, then people can write new efficient distribution functions in C that use this small C API agnostic to the core PRNG algorithm. It also makes it easy to implement new core PRNGs that the distribution functions provided by numpy.random can use.
In terms of what is needed, I think that the underlying PRNG should be
swappable. The will provide a simple mechanism to allow certain types of advancement while easily providing backward compat. In the current design this is very hard and requires compiling many nearly identical copies of RandomState. In pseudocode something like
standard_normal(prng)
where prng is a basic class that retains the PRNG state and has a small
set of core random number generators that belong to the underlying PRNG -- probably something like int32, int64, double, and possibly int53. I am not advocating explicitly passing the PRNG as an argument, but having generators which can take any suitable PRNG would add a lot of flexibility in terms of taking advantage of improvements in the underlying PRNGs (see, e.g., xoroshiro128/xorshift1024). The "small" core PRNG would have responsibility over state and streams. The remainder of the module would transform the underlying PRNG into the required distributions.
(edit: after writing the following verbiage, I realize it can be summed up with more respect to your suggestion: yes, we should do this design, but we don't need to and shouldn't give up on a class with distribution methods.)
Once the core PRNG C API is in place, I don't think we necessarily need to move away from a class structure per se, though it becomes an option. We just separate the core PRNG object from the distribution-providing class. We don't need to make copies of the distribution-providing class just to use a new core PRNG. I'm coming around to Nathaniel's suggestion for the constructor API (though not the distribution-versioning, for reasons I can get into later). We have a couple of core uniform PRNG classes like `MT19937` and `PCG128`. Those have a tiny API, and probably don't have a lot of unnecessary code clones between them. Their constructors can be different depending on the different ways they can be instantiated, depending on the PRNG's features. I'm not sure that they'll have any common methods besides `__getstate__/__setstate__` and probably a `copy()`. They will expose their C API as a Python-opaque attribute. They can have whatever algorithm-dependent methods they need (e.g. to support jumpahead). I might not even expose to Python the uint64 and U(0,1) double sampling methods, but maybe so.
Then we have a single `Distributions` class that provides all of the distributions that we want to support in numpy.random (i.e. what we currently have on `RandomState` and whatever passes our higher bar in the future). It takes one of the core PRNG instances as an argument to the constructor (nominally, at least; we can design factory functions to make this more convenient).
prng = Distributions(PCG128(seed)) x = prng.normal(mean, std)
If someone wants to write a WELL512 core PRNG, they can just implement that object and pass it to `Distributions()`. The `Distributions` code doesn't need to be copied, nor do we need to much around with `__new__` tricks in Cython.
Why do this instead of distribution functions? Well, I have a damn lot of code that is expecting an object with at least the broad outlines of the `RandomState` interface. I'm not going to update that code to use functions instead. And if I'm not, no one is. There isn't a good transition path since the PRNG object needs to thread through all of the code, whether it's my code that I'm writing greenfield or library code that I don't control. That said, people writing new distributions outside of numpy.random would be writing functions, not trying to add to the `Distributions` class, but that's fine.
It also allows the possibility for the `Distributions` class to be stateful if we want to do things like caching the next Box-Muller variate and not force that onto the core PRNG state like I currently do. Though I'd rather just drop Box-Muller, and that's not a common pattern outside of Box-Muller. But it's a possibility.
-- Robert Kern

Hello,
On Sat, Jan 27, 2018 at 09:28:54AM +0900, Robert Kern wrote:
On Sat, Jan 27, 2018 at 1:14 AM, Kevin Sheppard kevin.k.sheppard@gmail.com wrote:
In terms of what is needed, I think that the underlying PRNG should
be swappable. The will provide a simple mechanism to allow certain types of advancement while easily providing backward compat. In the current design this is very hard and requires compiling many nearly identical copies of RandomState. In pseudocode something like
standard_normal(prng)
where prng is a basic class that retains the PRNG state and has a
small set of core random number generators that belong to the underlying PRNG -- probably something like int32, int64, double, and possibly int53. I am not advocating explicitly passing the PRNG as an argument, but having generators which can take any suitable PRNG would add a lot of flexibility in terms of taking advantage of improvements in the underlying PRNGs (see, e.g., xoroshiro128/xorshift1024). The "small" core PRNG would have responsibility over state and streams. The remainder of the module would transform the underlying PRNG into the required distributions. (edit: after writing the following verbiage, I realize it can be summed up with more respect to your suggestion: yes, we should do this design, but we don't need to and shouldn't give up on a class with distribution methods.) Once the core PRNG C API is in place, I don't think we necessarily need to move away from a class structure per se, though it becomes an option.
(Sorry for cutting so much, I have a short question)
My typical use case for the C API of NumPy's random features is that I start coding in pure Python and then switch to Cython. I have at least twice in the past resorted to include "randomkit.h" and use that directly. My last work actually implements a Python/Cython interface for rngs, see http://threefry.readthedocs.io/using_from_cython.html
The goal is to use exactly the same backend in Python and Cython, with a cimport and a few cdefs the only changes needed for a first port to Cython.
Is this type of feature in discussion or in project for the future of numpy.random?
Regards,
Pierre

On Tue, Jan 30, 2018 at 5:39 AM, Pierre de Buyl < pierre.debuyl@chem.kuleuven.be> wrote:
Hello,
On Sat, Jan 27, 2018 at 09:28:54AM +0900, Robert Kern wrote:
On Sat, Jan 27, 2018 at 1:14 AM, Kevin Sheppard kevin.k.sheppard@gmail.com wrote:
In terms of what is needed, I think that the underlying PRNG should
be swappable. The will provide a simple mechanism to allow certain types of advancement while easily providing backward compat. In the current design this is very hard and requires compiling many nearly identical copies of RandomState. In pseudocode something like
standard_normal(prng)
where prng is a basic class that retains the PRNG state and has a
small set of core random number generators that belong to the underlying PRNG -- probably something like int32, int64, double, and possibly int53. I am not advocating explicitly passing the PRNG as an argument, but having generators which can take any suitable PRNG would add a lot of flexibility in terms of taking advantage of improvements in the underlying PRNGs (see, e.g., xoroshiro128/xorshift1024). The "small" core PRNG would have responsibility over state and streams. The remainder of the module would transform the underlying PRNG into the required distributions. (edit: after writing the following verbiage, I realize it can be summed up with more respect to your suggestion: yes, we should do this design, but we don't need to and shouldn't give up on a class with distribution methods.) Once the core PRNG C API is in place, I don't think we necessarily need to move away from a class structure per se, though it becomes an option.
(Sorry for cutting so much, I have a short question)
My typical use case for the C API of NumPy's random features is that I
start
coding in pure Python and then switch to Cython. I have at least twice in
the
past resorted to include "randomkit.h" and use that directly. My last work actually implements a Python/Cython interface for rngs, see http://threefry.readthedocs.io/using_from_cython.html
The goal is to use exactly the same backend in Python and Cython, with a
cimport
and a few cdefs the only changes needed for a first port to Cython.
Is this type of feature in discussion or in project for the future of numpy.random?
Sort of, but not really. For sure, once we've made the decisions that let us move forward to a new design, we'll have Cython implementations that can be used natively from Cython as well as Python without code changes. *But* it's not going to be an automatic speedup like your threefry library allows. You designed that API such that each of the methods returns a single scalar, so all you need to do is declare your functions `cpdef` and provide a `.pxd`. Our methods return either a scalar or an array depending on the arguments, so the methods will be declared to return `object`, and you will pay the overhead costs for checking the arguments and such. We're not going to change that Python API; we're only considering dropping stream-compatibility, not source-compatibility.
I would like to make sure that we do expose a C/Cython API to the distribution functions (i.e. that only draw a single variate and return a scalar), but it's not likely to look exactly like the Python API. There might be clever tricks that we can do to minimize the amount of changes that one needs to do, though, if you are only drawing single variates at a time (e.g. an agent-based simulation) and you want to make it go faster by moving to Cython. For example, maybe we collect all of the single-variate C-implemented methods into a single object sitting as an attribute on the `Distributions` object.
cdef class DistributionsCAPI: cdef double normal(double loc, double scale) cdef double uniform(double low, double high)
cdef class Distributions: cdef DistributionsCAPI capi cpdef object normal(loc, scale, size=None): if size is None and np.isscalar(loc) and np.isscalar(scale): return self.capi.normal(loc, scale) else: # ... Make an array
prng = Distributions(...) # From Python: x = prng.normal(mean, std) # From Cython: cdef double x = prng.capi.normal(mean, std)
But we need to make some higher-level decisions first before we can get down to this level of design. Please do jump in and remind us of this use case once we do get down to actual work on the new API design. Thanks!
-- Robert Kern
participants (3)
-
Kevin Sheppard
-
Pierre de Buyl
-
Robert Kern