I agree with pretty much everything you wrote Robert. I didn't have quote the right frame but the generic class that takes a low-level core PRNG sounds like the right design, and this should make user-generated distributions easier to develop. I was thinking along these lines inspired by the SpiPy changes that use a LowLevelCallable, e.g.,
This might also allow users to extend the core PRNGs using something like Numba JIT classes as an alternative.
Another area that needs though is how to correctly spawn in Multiprocess application. This might be most easily addressed by providing a guide on a good way rather than the arbitrary way used now.
On Sat, Jan 27, 2018 at 5:03 PM firstname.lastname@example.org wrote:
Send NumPy-Discussion mailing list submissions to email@example.com
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/numpy-discussion or, via email, send a message with subject or body 'help' to firstname.lastname@example.org
You can reach the person managing the list at email@example.com
When replying, please edit your Subject line so it is more specific than "Re: Contents of NumPy-Discussion digest..."
- Re: Moving NumPy's PRNG Forward (Robert Kern)
- Re: Using np.frombuffer and cffi.buffer on array of C structs (problem with struct member padding) (Joe)
Message: 1 Date: Sat, 27 Jan 2018 09:28:54 +0900 From: Robert Kern firstname.lastname@example.org To: Discussion of Numerical Python email@example.com Subject: Re: [Numpy-discussion] Moving NumPy's PRNG Forward Message-ID: <CAF6FJitFLv3U7gHkYBCFW69A5BbVe=HzBAm5oxYVcXGbBdMU= firstname.lastname@example.org> Content-Type: text/plain; charset="utf-8"
On Sat, Jan 27, 2018 at 1:14 AM, Kevin Sheppard < email@example.com> wrote:
I am a firm believer that the current situation is not sustainable.
There are a lot of improvements that can practically be incorporated. While many of these are performance related, there are also improvements in accuracy over some ranges of parameters that cannot be incorporated. I also think that perfect stream reproducibility is a bit of a myth across versions since this really would require identical OS, compiler and possibly CPU for some of the generators that produce floats.
I believe there is a case for separating the random generator from core
NumPy. Some points that favor becoming a subproject:
- It is a pure consumer of NumPy API. Other parts of the API do no
depend on random.
- A stand alone package could be installed along side many different
version of core NumPy which would reduce the pressure on freezing the stream.
Removing numpy.random (or freezing it as deprecated legacy while all PRNG development moves elsewhere) is probably a non-starter. It's too used for us not to provide something. That said, we can (and ought to) make it much easier for external packages to provide PRNG capabilities (core PRNGs and distributions) that interoperate with the core functionality that numpy provides. I'm also happy to place a high barrier on adding more distributions to numpy.random once that is in place.
Specifically, core uniform PRNGs should have a small common C API that distribution functions can use. This might just be a struct with an opaque `void*` state pointer and then 2 function pointers for drawing a uint64 (whole range) and a double in [0,1) from the state. It's important to expose our core uniform PRNGs as a C API because there has been a desire to interoperate at that level, using the same PRNG state inside C or Fortran or GPU code. If that's in place, then people can write new efficient distribution functions in C that use this small C API agnostic to the core PRNG algorithm. It also makes it easy to implement new core PRNGs that the distribution functions provided by numpy.random can use.
In terms of what is needed, I think that the underlying PRNG should be
swappable. The will provide a simple mechanism to allow certain types of advancement while easily providing backward compat. In the current design this is very hard and requires compiling many nearly identical copies of RandomState. In pseudocode something like
where prng is a basic class that retains the PRNG state and has a small
set of core random number generators that belong to the underlying PRNG -- probably something like int32, int64, double, and possibly int53. I am not advocating explicitly passing the PRNG as an argument, but having generators which can take any suitable PRNG would add a lot of flexibility in terms of taking advantage of improvements in the underlying PRNGs (see, e.g., xoroshiro128/xorshift1024). The "small" core PRNG would have responsibility over state and streams. The remainder of the module would transform the underlying PRNG into the required distributions.
(edit: after writing the following verbiage, I realize it can be summed up with more respect to your suggestion: yes, we should do this design, but we don't need to and shouldn't give up on a class with distribution methods.)
Once the core PRNG C API is in place, I don't think we necessarily need to move away from a class structure per se, though it becomes an option. We just separate the core PRNG object from the distribution-providing class. We don't need to make copies of the distribution-providing class just to use a new core PRNG. I'm coming around to Nathaniel's suggestion for the constructor API (though not the distribution-versioning, for reasons I can get into later). We have a couple of core uniform PRNG classes like `MT19937` and `PCG128`. Those have a tiny API, and probably don't have a lot of unnecessary code clones between them. Their constructors can be different depending on the different ways they can be instantiated, depending on the PRNG's features. I'm not sure that they'll have any common methods besides `__getstate__/__setstate__` and probably a `copy()`. They will expose their C API as a Python-opaque attribute. They can have whatever algorithm-dependent methods they need (e.g. to support jumpahead). I might not even expose to Python the uint64 and U(0,1) double sampling methods, but maybe so.
Then we have a single `Distributions` class that provides all of the distributions that we want to support in numpy.random (i.e. what we currently have on `RandomState` and whatever passes our higher bar in the future). It takes one of the core PRNG instances as an argument to the constructor (nominally, at least; we can design factory functions to make this more convenient).
prng = Distributions(PCG128(seed)) x = prng.normal(mean, std)
If someone wants to write a WELL512 core PRNG, they can just implement that object and pass it to `Distributions()`. The `Distributions` code doesn't need to be copied, nor do we need to much around with `__new__` tricks in Cython.
Why do this instead of distribution functions? Well, I have a damn lot of code that is expecting an object with at least the broad outlines of the `RandomState` interface. I'm not going to update that code to use functions instead. And if I'm not, no one is. There isn't a good transition path since the PRNG object needs to thread through all of the code, whether it's my code that I'm writing greenfield or library code that I don't control. That said, people writing new distributions outside of numpy.random would be writing functions, not trying to add to the `Distributions` class, but that's fine.
It also allows the possibility for the `Distributions` class to be stateful if we want to do things like caching the next Box-Muller variate and not force that onto the core PRNG state like I currently do. Though I'd rather just drop Box-Muller, and that's not a common pattern outside of Box-Muller. But it's a possibility.
-- Robert Kern