Re: [Numpy-discussion] SFMT (faster mersenne twister)

| Date: Fri, 05 Sep 2014 13:19:57 -0400 | From: Neal Becker <ndbecker2@gmail.com> | | I think it's somewhat debatable whether generating a different | sequence of random numbers counts as breaking backward | compatibility. Please don't ever, ever break the sequence of numpy's random numbers! Please! We have put a lot of effort into being able to reproduce our published work exactly, and all of that would be in vain if the sequence changes. See e.g.: http://journal.frontiersin.org/Journal/10.3389/fninf.2013.00044/full We'd be very happy to see additional number generators appear *alongside* the existing ones, though, particularly if there are faster or otherwise better ones! Jim Bednar -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

"James A. Bednar" <jbednar@inf.ed.ac.uk> wrote:
Please don't ever, ever break the sequence of numpy's random numbers! Please! We have put a lot of effort into being able to reproduce our published work exactly,
Jup, it cannot be understated how important this is for reproducibility of published research. Thus from a scientific standpoint it is important that random numbers are not random. Some might think that it's just important that they are as "random as possible", but reproducibility is just as essential to stochastic simulations. This is also why parallel random number generators and parallel stochastic algorithms are so hard to program, because the operating systems' scheduler can easily break the reproducibility. I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister). A particular candidate I think we should consider is the DCMT, which is exceptionally good for parallel algorithms (the DCMT code is now BSD licensed, it used to be LGPL). Because of the way randomkit it written, it is very easy to plug-in different generators. Sturla

In addition to issues with reproducibility, think of all of the unit tests that would break! On Sun, Sep 7, 2014 at 3:51 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
"James A. Bednar" <jbednar@inf.ed.ac.uk> wrote:
Please don't ever, ever break the sequence of numpy's random numbers! Please! We have put a lot of effort into being able to reproduce our published work exactly,
Jup, it cannot be understated how important this is for reproducibility of published research. Thus from a scientific standpoint it is important that random numbers are not random. Some might think that it's just important that they are as "random as possible", but reproducibility is just as essential to stochastic simulations. This is also why parallel random number generators and parallel stochastic algorithms are so hard to program, because the operating systems' scheduler can easily break the reproducibility. I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister). A particular candidate I think we should consider is the DCMT, which is exceptionally good for parallel algorithms (the DCMT code is now BSD licensed, it used to be LGPL). Because of the way randomkit it written, it is very easy to plug-in different generators.
Sturla
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Sun, Sep 7, 2014 at 4:23 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
Benjamin Root <ben.root@ou.edu> wrote:
In addition to issues with reproducibility, think of all of the unit tests that would break!
That is a reproducibility problem :)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Related aside about reproducibility of random numbers: IMO: scipy.stats.distributions.rvs does not guarantee yet the values of the random numbers except for those that are directly produced by numpy. In contrast to numpy.random, scipy's distributions don't have unit tests for the specific values of the rvs, and the rvs code for specific distribution could still be improved in some cases, I guess Josef

I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister).
Why not do something like the C++11 <random>? In <random>, a "generator" is the engine producing randomness, and a "distribution" decides what is the type of outputs that you want. Here is the example on http://www.cplusplus.com/reference/random/ . std::default_random_engine generator; std::uniform_int_distribution<int> distribution(1,6); int dice_roll = distribution(generator); // generates number in the range 1..6 For convenience, you can bind the generator with the distribution (still from the web page above). auto dice = std::bind(distribution, generator); int wisdom = dice()+dice()+dice(); Here is how I propose to adapt this scheme to numpy. First, there would be a global generator defaulting to the current implementation of Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and "set_state" would affect this global generator. All numpy functions associated to random number generation (i.e., everything listed on http://docs.scipy.org/doc/numpy/reference/routines.random.html except for "RandomState", "seed", "get_state" and "set_state") would accept the kwarg "generator", which defaults to the global generator (by default the current Mersene Twister). Now there could be other generator objects: the new Mersene Twister, some lightweight-but-worse generator, or some cryptographically-safe random generator. Each such generator would have "RandomState", "seed", "get_state" and "set_state" methods (except perhaps the criptographically-safe ones). When calling a numpy function with generator=my_generator, that function uses this generator instead the global one. Moreover, there would be be a function, say select_default_random_engine(generator), which changes the global generator to a user-specified one.
This is also why parallel random number generators and parallel stochastic algorithms are so hard to program, because the operating systems' scheduler can easily break the reproducibility.
What I propose would also simplify this: each thread can use its own independently-seeded generator. Timing is no longer a problem: as long as which-thread-does-what is not affected by the scheduler, the execution remains deterministic. On 09/07/2014 12:51 PM, Sturla Molden wrote:
"James A. Bednar" <jbednar@inf.ed.ac.uk> wrote:
Please don't ever, ever break the sequence of numpy's random numbers! Please! We have put a lot of effort into being able to reproduce our published work exactly, Jup, it cannot be understated how important this is for reproducibility of published research. Thus from a scientific standpoint it is important that random numbers are not random. Some might think that it's just important that they are as "random as possible", but reproducibility is just as essential to stochastic simulations. This is also why parallel random number generators and parallel stochastic algorithms are so hard to program, because the operating systems' scheduler can easily break the reproducibility. I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister). A particular candidate I think we should consider is the DCMT, which is exceptionally good for parallel algorithms (the DCMT code is now BSD licensed, it used to be LGPL). Because of the way randomkit it written, it is very easy to plug-in different generators.
Sturla

On 08.09.2014 19:05, Pierre-Andre Noel wrote:
I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister).
...
Here is how I propose to adapt this scheme to numpy. First, there would be a global generator defaulting to the current implementation of Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and "set_state" would affect this global generator.
All numpy functions associated to random number generation (i.e., everything listed on http://docs.scipy.org/doc/numpy/reference/routines.random.html except for "RandomState", "seed", "get_state" and "set_state") would accept the kwarg "generator", which defaults to the global generator (by default the current Mersene Twister).
I don't think every function would need a generator argument, for real world applications it should be sufficient to have the state object carry which generator is used and maybe a switch to change the global one. But as already mentioned by Robert, we know what we can do, what is missing is someone writting the code.

Julian Taylor <jtaylor.debian@googlemail.com> wrote:
But as already mentioned by Robert, we know what we can do, what is missing is someone writting the code.
This is actually a part of NumPy I know in detail, so I will be able to contribute. Robert Kern's last post about objects like np.random.SFMT() working similar to RandomState should be doable and not break any backwards compatibility. Sturla

On Mon, Sep 8, 2014 at 6:05 PM, Pierre-Andre Noel <noel.pierre.andre@gmail.com> wrote:
I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister).
Why not do something like the C++11 <random>? In <random>, a "generator" is the engine producing randomness, and a "distribution" decides what is the type of outputs that you want. Here is the example on http://www.cplusplus.com/reference/random/ .
std::default_random_engine generator; std::uniform_int_distribution<int> distribution(1,6); int dice_roll = distribution(generator); // generates number in the range 1..6
For convenience, you can bind the generator with the distribution (still from the web page above).
auto dice = std::bind(distribution, generator); int wisdom = dice()+dice()+dice();
Here is how I propose to adapt this scheme to numpy. First, there would be a global generator defaulting to the current implementation of Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and "set_state" would affect this global generator.
All numpy functions associated to random number generation (i.e., everything listed on http://docs.scipy.org/doc/numpy/reference/routines.random.html except for "RandomState", "seed", "get_state" and "set_state") would accept the kwarg "generator", which defaults to the global generator (by default the current Mersene Twister).
Now there could be other generator objects: the new Mersene Twister, some lightweight-but-worse generator, or some cryptographically-safe random generator. Each such generator would have "RandomState", "seed", "get_state" and "set_state" methods (except perhaps the criptographically-safe ones). When calling a numpy function with generator=my_generator, that function uses this generator instead the global one. Moreover, there would be be a function, say select_default_random_engine(generator), which changes the global generator to a user-specified one.
I think the Python standard library's example is more instructive. We have new classes for each new core uniform generator. They will share a common superclass to share the implementation of the non-uniform distributions. numpy.random.RandomState will continue to be the current Mersenne Twister implementation, and so will the underlying global RandomState for all of the convenience functions in numpy.random. If you want the SFMT variant, you instantiate numpy.random.SFMT() and call its methods directly. -- Robert Kern

On 8 Sep 2014 14:43, "Robert Kern" <robert.kern@gmail.com> wrote:
On Mon, Sep 8, 2014 at 6:05 PM, Pierre-Andre Noel <noel.pierre.andre@gmail.com> wrote:
I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister).
Why not do something like the C++11 <random>? In <random>, a "generator" is the engine producing randomness, and a "distribution" decides what is the type of outputs that you want. Here is the example on http://www.cplusplus.com/reference/random/ .
std::default_random_engine generator; std::uniform_int_distribution<int> distribution(1,6); int dice_roll = distribution(generator); // generates number in the range 1..6
For convenience, you can bind the generator with the distribution (still from the web page above).
auto dice = std::bind(distribution, generator); int wisdom = dice()+dice()+dice();
Here is how I propose to adapt this scheme to numpy. First, there would be a global generator defaulting to the current implementation of Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and "set_state" would affect this global generator.
All numpy functions associated to random number generation (i.e., everything listed on http://docs.scipy.org/doc/numpy/reference/routines.random.html except for "RandomState", "seed", "get_state" and "set_state") would accept the kwarg "generator", which defaults to the global generator (by default the current Mersene Twister).
Now there could be other generator objects: the new Mersene Twister, some lightweight-but-worse generator, or some cryptographically-safe random generator. Each such generator would have "RandomState", "seed", "get_state" and "set_state" methods (except perhaps the criptographically-safe ones). When calling a numpy function with generator=my_generator, that function uses this generator instead the global one. Moreover, there would be be a function, say select_default_random_engine(generator), which changes the global generator to a user-specified one.
I think the Python standard library's example is more instructive. We have new classes for each new core uniform generator. They will share a common superclass to share the implementation of the non-uniform distributions. numpy.random.RandomState will continue to be the current Mersenne Twister implementation, and so will the underlying global RandomState for all of the convenience functions in numpy.random. If you want the SFMT variant, you instantiate numpy.random.SFMT() and call its methods directly.
There's also another reason why generator decisions should be part of the RandomState object itself: we may want to change the distribution methods themselves over time (e.g., people have been complaining for a while that we use a suboptimal method for generating gaussian deviates), but changing these creates similar backcompat problems. So we need a way to say "please give me samples using the old gaussian implementation" or whatever. -n

On 09/09/14 20:08, Nathaniel Smith wrote:
There's also another reason why generator decisions should be part of the RandomState object itself: we may want to change the distribution methods themselves over time (e.g., people have been complaining for a while that we use a suboptimal method for generating gaussian deviates), but changing these creates similar backcompat problems.
Which one should we rather use? Ziggurat? Sturla

Pierre-Andre Noel <noel.pierre.andre@gmail.com> wrote:
Why not do something like the C++11 <random>? In <random>, a "generator" is the engine producing randomness, and a "distribution" decides what is the type of outputs that you want.
This is what randomkit is doing internally, which is why it is so easy to plug in a different generator. Sturla
participants (8)
-
Benjamin Root
-
James A. Bednar
-
josef.pktd@gmail.com
-
Julian Taylor
-
Nathaniel Smith
-
Pierre-Andre Noel
-
Robert Kern
-
Sturla Molden