Mailman 3 Design feedback solicitation - NumPy-Discussion

Design feedback solicitation

Pavlyk, Oleksandr

June 17, 2016

3:08 p.m.

Hi, I am new to this list, so I will start with an introduction. My name is Oleksandr Pavlyk. I now work at Intel Corp. on the Intel Distribution for Python, and previously worked at Wolfram Research for 12 years. My latest project was to write a mirror to numpy.random, named numpy.random_intel. The module uses MKL to sample from different distributions for efficiency. It provides support for different underlying algorithms for basic pseudo-random number generation, i.e. in addition to MT19937, it also provides SFMT19937, MT2203, etc. I recently published a blog about it: https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-gener... I originally attempted to simply replace numpy.random in the Intel Distribution for Python with the new module, but due to fixed seed backwards incompatibility this results in numerous test failures in numpy, scipy, pandas and other modules. Unlike numpy.random, the new module generates a vector of random numbers at a time, which can be done faster than repeatedly generating the same number of variates one at a time. The source code for the new module is not upstreamed yet, and this email is meant to solicit early community feedback to allow for faster acceptance of the proposed changes. Thank you, Oleksandr

Attachments:

attachment.htm (text/html — 3.5 KB)

Show replies by date

Robert Kern

June 2016

3:22 p.m.

On Fri, Jun 17, 2016 at 4:08 PM, Pavlyk, Oleksandr < oleksandr.pavlyk@intel.com> wrote:

...

Hi,

I am new to this list, so I will start with an introduction. My name is

Oleksandr Pavlyk. I now work at Intel Corp. on the Intel Distribution for Python, and previously worked at Wolfram Research for 12 years. My latest project was to write a mirror to numpy.random, named numpy.random_intel. The module uses MKL to sample from different distributions for efficiency. It provides support for different underlying algorithms for basic pseudo-random number generation, i.e. in addition to MT19937, it also provides SFMT19937, MT2203, etc.

...

I recently published a blog about it:

https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-gener...

...

I originally attempted to simply replace numpy.random in the Intel

Distribution for Python with the new module, but due to fixed seed backwards incompatibility this results in numerous test failures in numpy, scipy, pandas and other modules.

...

Unlike numpy.random, the new module generates a vector of random numbers

at a time, which can be done faster than repeatedly generating the same number of variates one at a time.

...

The source code for the new module is not upstreamed yet, and this email

is meant to solicit early community feedback to allow for faster acceptance of the proposed changes. Cool! You can find pertinent discussion here: https://github.com/numpy/numpy/issues/6967 And the current effort for adding new core PRNGs here: https://github.com/bashtage/ng-numpy-randomstate -- Robert Kern

Charles R Harris

8:41 p.m.

On Fri, Jun 17, 2016 at 9:22 AM, Robert Kern <robert.kern@gmail.com> wrote:

...

I wonder if the easiest thing to do at this point might be to implement a new redesigned random module and keep the old one around for backward compatibility? Not that that would make everything easy, but at least folks could choose to use the new functions for speed and versatility if they needed them. The current random module is pretty stable so maintenance should not be too onerous. Chuck

Pavlyk, Oleksandr

July 2016

1:53 a.m.

Hi Robert, Thank you for the pointers. I think numpy.random should have a mechanism to choose between methods for generating the underlying randomness dynamically, at a run-time, as well as an extensible framework, where developers could add more methods. The default would be MT19937 for backwards compatibility. It is important to be able to do this at a run-time, as it would allow one to use different algorithms in different threads (like different members of the parallel Mersenne twister family of generators, see MT2203). The framework should allow to define randomness as a bit stream, a stream of fixed size integers, or a stream of uniform reals (32 or 64 bits). This is a lot of like MKL’s abstract method for basic pseudo-random number generation. https://software.intel.com/en-us/node/590373 Each method should provide routines to sample from uniform distributions over reals (in floats and doubles), as well as over integers. All remaining non-uniform distributions build on top of these uniform streams. I think it is pretty important to refactor numpy.random to allow the underlying generators to produce a given number of independent variates at a time. There could be convenience wrapper functions to allow to get one variate for backwards compatibility, but this change in design would allow for better efficiency, as sampling a vector of random variates at once is often faster than repeated sampling of one at a time due to set-up cost, vectorization, etc. Finally, methods to sample particular distribution should uniformly support method keyword argument. Because method names vary from distribution to distribution, it should ideally be programmatically discoverable which methods are supported for a given distribution. For instance, the standard normal distribution could support method=’Inversion’, method=’Box-Muller’, method=’Ziggurat’, method=’Box-Muller-Marsaglia’ (the one used in numpy.random right now), as well as bunch of non-named methods based on transformed rejection method (see http://statistik.wu-wien.ac.at/anuran/ ) It would also be good if one could dynamically register a new method to sample from a non-uniform distribution. This would allow, for instance, to automatically add methods to sample certain non-uniform distribution by directly calling into MKL (or other library), when available, instead of building them from uniforms (which may remain a fall-through method). The linked project is a good start, but the choice of the underlying algorithm needs to be made at a run-time, as far as I understood, and the only provided interface to query random variates is one at a time, just like it is currently the case in numpy.random. Oleksandr From: NumPy-Discussion [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Robert Kern Sent: Friday, June 17, 2016 10:23 AM To: Discussion of Numerical Python <numpy-discussion@scipy.org> Subject: Re: [Numpy-discussion] Design feedback solicitation On Fri, Jun 17, 2016 at 4:08 PM, Pavlyk, Oleksandr <oleksandr.pavlyk@intel.com<mailto:oleksandr.pavlyk@intel.com>> wrote:

...

Cool! You can find pertinent discussion here: https://github.com/numpy/numpy/issues/6967 And the current effort for adding new core PRNGs here: https://github.com/bashtage/ng-numpy-randomstate -- Robert Kern

Robert Kern

2:14 a.m.

On Fri, Jul 15, 2016 at 2:53 AM, Pavlyk, Oleksandr < oleksandr.pavlyk@intel.com> wrote:

...

Hi Robert,

Thank you for the pointers.

I think numpy.random should have a mechanism to choose between methods

for generating the underlying randomness dynamically, at a run-time, as well as an extensible framework, where developers could add more methods. The default would be MT19937 for backwards compatibility. It is important to be able to do this at a run-time, as it would allow one to use different algorithms in different threads (like different members of the parallel Mersenne twister family of generators, see MT2203).

...

The framework should allow to define randomness as a bit stream, a stream

of fixed size integers, or a stream of uniform reals (32 or 64 bits). This is a lot of like MKL’s abstract method for basic pseudo-random number generation.

...

Each method should provide routines to sample from uniform distributions

over reals (in floats and doubles), as well as over integers.

...

All remaining non-uniform distributions build on top of these uniform

streams. ng-numpy-randomstate does all of these.

...

I think it is pretty important to refactor numpy.random to allow the underlying generators to produce a given number of independent variates at a time. There could be convenience wrapper functions to allow to get one variate for backwards compatibility, but this change in design would allow for better efficiency, as sampling a vector of random variates at once is often faster than repeated sampling of one at a time due to set-up cost, vectorization, etc.

The underlying C implementation is an implementation detail, so the refactoring that you suggest has no backwards compatibility constraints.

...

Finally, methods to sample particular distribution should uniformly support method keyword argument. Because method names vary from distribution to distribution, it should ideally be programmatically discoverable which methods are supported for a given distribution. For instance, the standard normal distribution could support method=’Inversion’, method=’Box-Muller’, method=’Ziggurat’, method=’Box-Muller-Marsaglia’ (the one used in numpy.random right now), as well as bunch of non-named methods based on transformed rejection method (see http://statistik.wu-wien.ac.at/anuran/ )

...

It would also be good if one could dynamically register a new method to sample from a non-uniform distribution. This would allow, for instance, to automatically add methods to sample certain non-uniform distribution by

That is one of the items under discussion. I personally prefer that one simply exposes named methods for each different scheme (e.g. ziggurat_normal(), etc.). directly calling into MKL (or other library), when available, instead of building them from uniforms (which may remain a fall-through method).

...

The linked project is a good start, but the choice of the underlying

algorithm needs to be made at a run-time, That's what happens. You instantiate the RandomState class that you want.

...

as far as I understood, and the only provided interface to query random variates is one at a time, just like it is currently the case in numpy.random.

-- Robert Kern

Robert Kern

June 2016

3:22 p.m.

On Fri, Jun 17, 2016 at 4:08 PM, Pavlyk, Oleksandr < oleksandr.pavlyk@intel.com> wrote:

...

Hi,

I am new to this list, so I will start with an introduction. My name is

...

I recently published a blog about it:

https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-gener...

...

I originally attempted to simply replace numpy.random in the Intel

Distribution for Python with the new module, but due to fixed seed backwards incompatibility this results in numerous test failures in numpy, scipy, pandas and other modules.

...

Unlike numpy.random, the new module generates a vector of random numbers

at a time, which can be done faster than repeatedly generating the same number of variates one at a time.

...

The source code for the new module is not upstreamed yet, and this email

Charles R Harris

8:41 p.m.

On Fri, Jun 17, 2016 at 9:22 AM, Robert Kern <robert.kern@gmail.com> wrote:

...

Pavlyk, Oleksandr

July 2016

1:53 a.m.

...

Robert Kern

2:14 a.m.

On Fri, Jul 15, 2016 at 2:53 AM, Pavlyk, Oleksandr < oleksandr.pavlyk@intel.com> wrote:

...

Hi Robert,

Thank you for the pointers.

I think numpy.random should have a mechanism to choose between methods

...

The framework should allow to define randomness as a bit stream, a stream

of fixed size integers, or a stream of uniform reals (32 or 64 bits). This is a lot of like MKL’s abstract method for basic pseudo-random number generation.

...

Each method should provide routines to sample from uniform distributions

over reals (in floats and doubles), as well as over integers.

...

All remaining non-uniform distributions build on top of these uniform

streams. ng-numpy-randomstate does all of these.

...

I think it is pretty important to refactor numpy.random to allow the underlying generators to produce a given number of independent variates at a time. There could be convenience wrapper functions to allow to get one variate for backwards compatibility, but this change in design would allow for better efficiency, as sampling a vector of random variates at once is often faster than repeated sampling of one at a time due to set-up cost, vectorization, etc.

The underlying C implementation is an implementation detail, so the refactoring that you suggest has no backwards compatibility constraints.

...

Finally, methods to sample particular distribution should uniformly support method keyword argument. Because method names vary from distribution to distribution, it should ideally be programmatically discoverable which methods are supported for a given distribution. For instance, the standard normal distribution could support method=’Inversion’, method=’Box-Muller’, method=’Ziggurat’, method=’Box-Muller-Marsaglia’ (the one used in numpy.random right now), as well as bunch of non-named methods based on transformed rejection method (see http://statistik.wu-wien.ac.at/anuran/ )

...

It would also be good if one could dynamically register a new method to sample from a non-uniform distribution. This would allow, for instance, to automatically add methods to sample certain non-uniform distribution by

...

The linked project is a good start, but the choice of the underlying

algorithm needs to be made at a run-time, That's what happens. You instantiate the RandomState class that you want.

...

as far as I understood, and the only provided interface to query random variates is one at a time, just like it is currently the case in numpy.random.

-- Robert Kern

3138

Age (days ago)

3166

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Charles R Harris
Pavlyk, Oleksandr
Robert Kern

Design feedback solicitation

tags

participants (3)