Mailman 3 Backwards-incompatible improvements to numpy.random.RandomState - NumPy-Discussion

Backwards-incompatible improvements to numpy.random.RandomState

Antony Lee

May 24, 2015

1:22 a.m.

Hi, As mentioned in #1450: Patch with Ziggurat method for Normal distribution #5158: ENH: More efficient algorithm for unweighted random choice without replacement #5299: using `random.choice` to sample integers in a large range #5851: Bug in np.random.dirichlet for small alpha parameters some methods on np.random.RandomState are implemented either non-optimally (#1450, #5158, #5299) or have outright bugs (#5851), but cannot be easily changed due to backwards compatibility concerns. While some have suggested new methods deprecating the old ones (see e.g. #5872), some consensus has formed around the following ideas (see #5299 for original discussion, followed by private discussions with @njsmith): - Backwards compatibility should only be provided to those who were explicitly instantiating a seeded RandomState object or reseeding a RandomState object to a given value, and drawing variates from it: using the global methods (or a None-seeded RandomState) was already non-reproducible anyways as e.g. other libraries could be drawing variates from the global RandomState (of which the free functions in np.random are actually methods). Thus, the global RandomState object should use the latest implementation of the methods. - "RandomState(seed)" and "r = RandomState(...); r.seed(seed)" should offer backwards-compatibility guarantees (see e.g. https://docs.python.org/3.4/library/random.html#notes-on-reproducibility). As such, we propose the following improvements to the API: - RandomState gains a (keyword-only) parameter, "version", also accessible as a read-only attribute. This indicates the version of the methods on the object. The current version of RandomState is retroactively assigned version 0. The latest available version is available as np.random.LATEST_VERSION. Backwards-incompatible improvements to RandomState methods can be introduced but increase the LAGTEST_VERSION. - The global RandomState is instantiated as RandomState(version=LATEST_VERSION). - RandomState() and rs.seed() sets the version to LATEST_VERSION. - RandomState(seed[!=None]) and rs.seed(seed[!=None]) sets the version to 0. A proof-of-concept implementation, still missing tests, is tracked as #5911. It includes the patch proposed in #5158 as an example of how to include an improved version of random.choice. Comments, and help for writing tests (in particular to make sure backwards compatibility is maintained) are welcome. Antony Lee

Attachments:

attachment.htm (text/html — 2.9 KB)

Show replies by date

Ralf Gommers

May 2015

1:59 a.m.

On Sun, May 24, 2015 at 10:22 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

Hi,

As mentioned in

#1450: Patch with Ziggurat method for Normal distribution #5158: ENH: More efficient algorithm for unweighted random choice without replacement #5299: using `random.choice` to sample integers in a large range #5851: Bug in np.random.dirichlet for small alpha parameters

some methods on np.random.RandomState are implemented either non-optimally (#1450, #5158, #5299) or have outright bugs (#5851), but cannot be easily changed due to backwards compatibility concerns. While some have suggested new methods deprecating the old ones (see e.g. #5872), some consensus has formed around the following ideas (see #5299 for original discussion, followed by private discussions with @njsmith):

- Backwards compatibility should only be provided to those who were explicitly instantiating a seeded RandomState object or reseeding a RandomState object to a given value, and drawing variates from it: using the global methods (or a None-seeded RandomState) was already non-reproducible anyways as e.g. other libraries could be drawing variates from the global RandomState (of which the free functions in np.random are actually methods). Thus, the global RandomState object should use the latest implementation of the methods.

The rest of the proposal looks good to me, but the reasoning on this point is shaky. np.random.seed() is *very* widely used, and works fine for a test suite where each test that needs random numbers calls seed(...) and is run with nose. Can you explain why you need to touch the behavior of the global methods in order to make RandomState(version=) work? Ralf - "RandomState(seed)" and "r = RandomState(...); r.seed(seed)" should offer

...

backwards-compatibility guarantees (see e.g. https://docs.python.org/3.4/library/random.html#notes-on-reproducibility).

As such, we propose the following improvements to the API:

- RandomState gains a (keyword-only) parameter, "version", also accessible as a read-only attribute. This indicates the version of the methods on the object. The current version of RandomState is retroactively assigned version 0. The latest available version is available as np.random.LATEST_VERSION. Backwards-incompatible improvements to RandomState methods can be introduced but increase the LAGTEST_VERSION.

- The global RandomState is instantiated as RandomState(version=LATEST_VERSION).

- RandomState() and rs.seed() sets the version to LATEST_VERSION.

- RandomState(seed[!=None]) and rs.seed(seed[!=None]) sets the version to 0.

A proof-of-concept implementation, still missing tests, is tracked as #5911. It includes the patch proposed in #5158 as an example of how to include an improved version of random.choice.

Comments, and help for writing tests (in particular to make sure backwards compatibility is maintained) are welcome.

Antony Lee

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

2:30 a.m.

On May 24, 2015 2:03 AM, "Ralf Gommers" <ralf.gommers@gmail.com> wrote:

...

On Sun, May 24, 2015 at 10:22 AM, Antony Lee <antony.lee@berkeley.edu>

...

...
Hi,

As mentioned in

#1450: Patch with Ziggurat method for Normal distribution #5158: ENH: More efficient algorithm for unweighted random choice

without replacement

...
#5299: using `random.choice` to sample integers in a large range #5851: Bug in np.random.dirichlet for small alpha parameters

some methods on np.random.RandomState are implemented either non-optimally (#1450, #5158, #5299) or have outright bugs (#5851), but cannot be easily changed due to backwards compatibility concerns. While some have suggested new methods deprecating the old ones (see e.g. #5872), some consensus has formed around the following ideas (see #5299 for original discussion, followed by private discussions with @njsmith):

- Backwards compatibility should only be provided to those who were explicitly instantiating a seeded RandomState object or reseeding a RandomState object to a given value, and drawing variates from it: using

wrote: the global methods (or a None-seeded RandomState) was already non-reproducible anyways as e.g. other libraries could be drawing variates from the global RandomState (of which the free functions in np.random are actually methods). Thus, the global RandomState object should use the latest implementation of the methods.

...

The rest of the proposal looks good to me, but the reasoning on this

point is shaky. np.random.seed() is *very* widely used, and works fine for a test suite where each test that needs random numbers calls seed(...) and is run with nose. Can you explain why you need to touch the behavior of the global methods in order to make RandomState(version=) work? You're absolutely right about it being important to preserve the behavior of the global functions when seeded, but I think this is just a bug in the description of the proposal here, not in the proposal itself :-). If you look at the PR, there's no change to how the global functions work -- they're still just a transparently thin wrapper around a hidden, global RandomState object, and thus IIUC changes to RandomState will automatically apply to the global functions as well. So with this proposal, an unseeded RandomState uses the latest version -> therefore the global functions, which start out unseeded, start out using the latest version. If you call .seed() on an existing RandomState object and pass in a seed but no version= argument, the version gets reset to 0 -> therefore if you call the global seed() function and pass in a seed but no version= argument, the global RandomState gets reset to version 0 (at least until the next time seed() is called), and backcompat is preserved. -n

Ralf Gommers

2:54 a.m.

On Sun, May 24, 2015 at 11:30 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

So with this proposal, an unseeded RandomState uses the latest version -> therefore the global functions, which start out unseeded, start out using the latest version. If you call .seed() on an existing RandomState object and pass in a seed but no version= argument, the version gets reset to 0 -> therefore if you call the global seed() function and pass in a seed but no version= argument, the global RandomState gets reset to version 0 (at least until the next time seed() is called), and backcompat is preserved.

On May 24, 2015 2:03 AM, "Ralf Gommers" <ralf.gommers@gmail.com> wrote:

...
On Sun, May 24, 2015 at 10:22 AM, Antony Lee <antony.lee@berkeley.edu>

...
...
Hi,

As mentioned in

#1450: Patch with Ziggurat method for Normal distribution #5158: ENH: More efficient algorithm for unweighted random choice

without replacement

...
#5299: using `random.choice` to sample integers in a large range #5851: Bug in np.random.dirichlet for small alpha parameters

some methods on np.random.RandomState are implemented either non-optimally (#1450, #5158, #5299) or have outright bugs (#5851), but cannot be easily changed due to backwards compatibility concerns. While some have suggested new methods deprecating the old ones (see e.g. #5872), some consensus has formed around the following ideas (see #5299 for original discussion, followed by private discussions with @njsmith):

- Backwards compatibility should only be provided to those who were explicitly instantiating a seeded RandomState object or reseeding a RandomState object to a given value, and drawing variates from it: using

wrote: the global methods (or a None-seeded RandomState) was already non-reproducible anyways as e.g. other libraries could be drawing variates from the global RandomState (of which the free functions in np.random are actually methods). Thus, the global RandomState object should use the latest implementation of the methods.

...
The rest of the proposal looks good to me, but the reasoning on this

point is shaky. np.random.seed() is *very* widely used, and works fine for a test suite where each test that needs random numbers calls seed(...) and is run with nose. Can you explain why you need to touch the behavior of the global methods in order to make RandomState(version=) work? You're absolutely right about it being important to preserve the behavior of the global functions when seeded, but I think this is just a bug in the description of the proposal here, not in the proposal itself :-). If you look at the PR, there's no change to how the global functions work -- they're still just a transparently thin wrapper around a hidden, global RandomState object, and thus IIUC changes to RandomState will automatically apply to the global functions as well.

Thanks for the clarification. Then +1 from me for this proposal.

Ralf

Alan G Isaac

5:41 a.m.

I echo Ralf's question. For those who need replicability, the proposed upgrade path seems quite radical. Also, I would prefer to have the new functionality introduced beside the existing implementation of RandomState, with an announcement that RandomState will change in the next major numpy version number. This will allow everyone who wants to to change now, without requiring that users attend to minor numpy version numbers if they want replicability. I think this is what is required by semantic versioning. Alan Isaac On 5/24/2015 4:59 AM, Ralf Gommers wrote:

...

the reasoning on this point is shaky. np.random.seed() is *very* widely used, and works fine for a test suite where each test that needs random numbers calls seed(...) and is run with nose. Can you explain why you need to touch the behavior of the global methods in order to make RandomState(version=) work?

Ralf Gommers

5:47 a.m.

On Sun, May 24, 2015 at 2:41 PM, Alan G Isaac <alan.isaac@gmail.com> wrote:

...

I echo Ralf's question. For those who need replicability, the proposed upgrade path seems quite radical.

It's not radical, and my question was already answered. Nothing changes if you are doing: np.random.seed(1234) np.random.any_random_sample_generator_func() Values only change if you leave out the call to seed(), which you should never do if you care about replicability. Ralf

...

Also, I would prefer to have the new functionality introduced beside the existing implementation of RandomState, with an announcement that RandomState will change in the next major numpy version number. This will allow everyone who wants to to change now, without requiring that users attend to minor numpy version numbers if they want replicability.

I think this is what is required by semantic versioning.

Alan Isaac

On 5/24/2015 4:59 AM, Ralf Gommers wrote:

...
the reasoning on this point is shaky. np.random.seed() is *very* widely used, and works fine for a test suite where each test that needs random numbers calls seed(...) and is run with nose. Can you explain why you need to touch the behavior of the global methods in order to make RandomState(version=) work?

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Alan G Isaac

6:08 a.m.

On 5/24/2015 8:47 AM, Ralf Gommers wrote:

...

Values only change if you leave out the call to seed()

OK, but this claim seems to conflict with the following language: "the global RandomState object should use the latest implementation of the methods". I take it that this is what Nathan meant by "I think this is just a bug in the description of the proposal here, not in the proposal itself". So, is the correct phrasing "the global RandomState object should use the latest implementation of the methods, unless explicitly seeded"? Thanks, Alan

josef.pktd＠gmail.com

8:04 a.m.

On Sun, May 24, 2015 at 9:08 AM, Alan G Isaac <alan.isaac@gmail.com> wrote:

...

On 5/24/2015 8:47 AM, Ralf Gommers wrote:

...
Values only change if you leave out the call to seed()

OK, but this claim seems to conflict with the following language: "the global RandomState object should use the latest implementation of the methods". I take it that this is what Nathan meant by "I think this is just a bug in the description of the proposal here, not in the proposal itself".

So, is the correct phrasing "the global RandomState object should use the latest implementation of the methods, unless explicitly seeded"?

that's how I understand it. I don't see any problems with the clarified proposal for the use cases that I know of. Can we choose the version also for the global random state, for example to fix both version and seed in unit tests, with version > 0? BTW: I would expect that bug fixes are still exempt from backwards compatibility. fixing #5851 should be independent of the version, (without having looked at the issue) (If you need to replicate bugs, then use an old version of a package.) Josef

...

Thanks, Alan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Anne Archibald

8:13 a.m.

Do we want a deprecation-like approach, so that eventually people who want replicability will specify versions, and everyone else gets bug fixes and improvements? This would presumably take several major versions, but it might avoid people getting unintentionally trapped on this version. Incidentally, bug fixes are complicated: if a bug fix uses more or fewer raw random numbers, it breaks repeatability not just for the call that got fixed but for all successive random number generations. Anne On Sun, May 24, 2015 at 5:04 PM <josef.pktd@gmail.com> wrote:

...

On Sun, May 24, 2015 at 9:08 AM, Alan G Isaac <alan.isaac@gmail.com> wrote:

...
On 5/24/2015 8:47 AM, Ralf Gommers wrote:

...
Values only change if you leave out the call to seed()

OK, but this claim seems to conflict with the following language: "the global RandomState object should use the latest implementation of the methods". I take it that this is what Nathan meant by "I think this is just a bug in the description of the proposal here, not in the proposal itself".

So, is the correct phrasing "the global RandomState object should use the latest implementation of the methods, unless explicitly seeded"?

that's how I understand it.

I don't see any problems with the clarified proposal for the use cases that I know of.

Can we choose the version also for the global random state, for example to fix both version and seed in unit tests, with version > 0?

BTW: I would expect that bug fixes are still exempt from backwards compatibility.

fixing #5851 should be independent of the version, (without having looked at the issue)

(If you need to replicate bugs, then use an old version of a package.)

Josef

...
Thanks, Alan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

8:40 a.m.

On Sun, May 24, 2015 at 11:13 AM, Anne Archibald <archibald@astron.nl> wrote:

...

Do we want a deprecation-like approach, so that eventually people who want replicability will specify versions, and everyone else gets bug fixes and improvements? This would presumably take several major versions, but it might avoid people getting unintentionally trapped on this version.

Incidentally, bug fixes are complicated: if a bug fix uses more or fewer raw random numbers, it breaks repeatability not just for the call that got fixed but for all successive random number generations.

Reminder: we are bottom or inline posting

...

Anne

On Sun, May 24, 2015 at 5:04 PM <josef.pktd@gmail.com> wrote:

...
On Sun, May 24, 2015 at 9:08 AM, Alan G Isaac <alan.isaac@gmail.com> wrote:

...
On 5/24/2015 8:47 AM, Ralf Gommers wrote:

...
Values only change if you leave out the call to seed()

OK, but this claim seems to conflict with the following language: "the global RandomState object should use the latest implementation of the methods". I take it that this is what Nathan meant by "I think this is just a bug in the description of the proposal here, not in the proposal itself".

So, is the correct phrasing "the global RandomState object should use the latest implementation of the methods, unless explicitly seeded"?

that's how I understand it.

I don't see any problems with the clarified proposal for the use cases that I know of.

Can we choose the version also for the global random state, for example to fix both version and seed in unit tests, with version > 0?

BTW: I would expect that bug fixes are still exempt from backwards compatibility.

fixing #5851 should be independent of the version, (without having looked at the issue)

I skimmed the issue. In a strict sense it's not really a bug, the user doesn't get wrong numbers, he or she gets Not A Number. So there are no current usages that use the function in that range. Josef

...

...
(If you need to replicate bugs, then use an old version of a package.)

Josef

...
Thanks, Alan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

10:49 a.m.

On May 24, 2015 8:43 AM, <josef.pktd@gmail.com> wrote:

...

Reminder: we are bottom or inline posting

Can we stop hassling people about this? Inline replies are a great tool to have in your toolkit for complicated technical discussions, but I feel like our weird insistence on them has turned into a pointless and exclusionary thing. It's not like bottom replying is even any better -- the traditional mailing list rule is you trim quotes to just the part you're replying to (like this message); quoting the whole thing and replying underneath just to give people a bit of exercise for their scrolling finger would totally have gotten you flamed too. But email etiquette has moved on since the 90s, even regular posters to this list violate this "rule" all the time, it's time to let it go. -n

josef.pktd＠gmail.com

11:01 a.m.

On Sun, May 24, 2015 at 1:49 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On May 24, 2015 8:43 AM, <josef.pktd@gmail.com> wrote:

...
Reminder: we are bottom or inline posting

Can we stop hassling people about this? Inline replies are a great tool to have in your toolkit for complicated technical discussions, but I feel like our weird insistence on them has turned into a pointless and exclusionary thing. It's not like bottom replying is even any better -- the traditional mailing list rule is you trim quotes to just the part you're replying to (like this message); quoting the whole thing and replying underneath just to give people a bit of exercise for their scrolling finger would totally have gotten you flamed too.

But email etiquette has moved on since the 90s, even regular posters to this list violate this "rule" all the time, it's time to let it go.

It's not a 90's thing and I learned about it around 2009 when I started in here. I find it very annoying trying to catch up with a longer thread and the replies are all over the place. Anne is a few years older than I in terms of numpy and scipy participation and this was just intended to be a friendly reminder. And as BTW: I'm glad Anne is back with scipy. Josef

...

-n

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

1:39 p.m.

On May 24, 2015 11:04 AM, <josef.pktd@gmail.com> wrote:

...

On Sun, May 24, 2015 at 1:49 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On May 24, 2015 8:43 AM, <josef.pktd@gmail.com> wrote:

...
Reminder: we are bottom or inline posting

Can we stop hassling people about this? Inline replies are a great tool

...

...
But email etiquette has moved on since the 90s, even regular posters to

to have in your toolkit for complicated technical discussions, but I feel like our weird insistence on them has turned into a pointless and exclusionary thing. It's not like bottom replying is even any better -- the traditional mailing list rule is you trim quotes to just the part you're replying to (like this message); quoting the whole thing and replying underneath just to give people a bit of exercise for their scrolling finger would totally have gotten you flamed too. this list violate this "rule" all the time, it's time to let it go.

...

It's not a 90's thing and I learned about it around 2009 when I started

...

I find it very annoying trying to catch up with a longer thread and the replies are all over the place.

Anne is a few years older than I in terms of numpy and scipy

in here. participation and this was just intended to be a friendly reminder. And while I know you didn't mean it this way, I'm guessing that being immediately greeted by criticism for failing to follow some arbitrary and inconsistently-applied rule was indeed a strong reminder of what a unpleasant place FOSS mailing lists can sometimes be, and why someone might disappear from them for a few years. I think we can do better. This is pretty off-topic for this thread, though, see so let's let it lie here. If anyone desperately needs to comment further please email me off-list. -n

Nathaniel Smith

11:04 a.m.

On May 24, 2015 8:15 AM, "Anne Archibald" <archibald@astron.nl> wrote:

...

Do we want a deprecation-like approach, so that eventually people who

want replicability will specify versions, and everyone else gets bug fixes and improvements? This would presumably take several major versions, but it might avoid people getting unintentionally trapped on this version. I'm not sure what you're envisioning as needing a deprecation cycle? The neat thing about random is that we already have a way for users to say that they want replicability -- the use of an explicit seed -- so we can just immediately go to the world you describe, where people who seed get to pick their version (or default to version 0 for backcompat), and everyone else gets the improvements automatically. Or is this different from what you meant somehow? Fortunately we haven't yet run into any really serious bugs in random, like "oops we're sampling from the wrong distribution" type bugs. Mostly it's more like "oops this is really inefficient" or "oops this crashes in this edge case", so there's no real harm in letting people use old versions. If we did run into a case where we were giving flat out wrong results, then I guess we'd still want to keep the code around because reproducibility is still important, but perhaps with a requirement that you pass an extra argument like I_know_its_broken=True or something so that people couldn't end up running the broken code accidentally? I guess we'll cross that bridge when we come to it.

...

Incidentally, bug fixes are complicated: if a bug fix uses more or fewer raw random numbers, it breaks repeatability not just for the call that got fixed but for all successive random number generations.

Yep. This is why we mostly haven't been able to change behavior at *all* except in cases where there was a clear error so we know no-one was using something. -n

Sturla Molden

11:56 a.m.

On 24/05/15 20:04, Nathaniel Smith wrote:

...

I'm not sure what you're envisioning as needing a deprecation cycle? The neat thing about random is that we already have a way for users to say that they want replicability -- the use of an explicit seed --

No, this is not sufficient for random numbers. Random sampling and ziggurat generators are examples. If we introduce a change (e.g. a bugfix) that will affect the number of calls to the entropy source, just setting the seed will in general not be enough to ensure backwards compatibility. That is e.g. the case with using ziggurat samplers instead of the current transcendental transforms for normal, exponential and gamma distributions. While ziggurat is faster (and to my knowledge) more accurate, it will also make a different number of calls to the entropy source, and hence the whole sequence will be affected, even if you do set a random seed. Sturla

Robert Kern

12:25 p.m.

On Sun, May 24, 2015 at 7:56 PM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

On 24/05/15 20:04, Nathaniel Smith wrote:

...
I'm not sure what you're envisioning as needing a deprecation cycle? The neat thing about random is that we already have a way for users to say that they want replicability -- the use of an explicit seed --

No, this is not sufficient for random numbers. Random sampling and ziggurat generators are examples. If we introduce a change (e.g. a bugfix) that will affect the number of calls to the entropy source, just setting the seed will in general not be enough to ensure backwards compatibility. That is e.g. the case with using ziggurat samplers instead of the current transcendental transforms for normal, exponential and gamma distributions. While ziggurat is faster (and to my knowledge) more accurate, it will also make a different number of calls to the entropy source, and hence the whole sequence will be affected, even if you do set a random seed.

Please reread the proposal at the top of the thread. -- Robert Kern

Antony Lee

1:15 p.m.

Thanks to Nathaniel who has indeed clarified my intent, i.e. "the global RandomState should use the latest implementation, unless explicitly seeded". More generally, the `RandomState` constructor is just a thin wrapper around `seed` with the same signature, so one can swap the version of the global functions with a call to `np.random.seed(version=...)`.

Sturla Molden

11:46 a.m.

On 24/05/15 17:13, Anne Archibald wrote:

...

Do we want a deprecation-like approach, so that eventually people who want replicability will specify versions, and everyone else gets bug fixes and improvements? This would presumably take several major versions, but it might avoid people getting unintentionally trapped on this version.

Incidentally, bug fixes are complicated: if a bug fix uses more or fewer raw random numbers, it breaks repeatability not just for the call that got fixed but for all successive random number generations.

If a function has a bug, changing it will change the output of the function. This is not special for random numbers. If not retaining the old erroneous output means we break-backwards compatibility, then no bugs can ever be fixed, anywhere in NumPy. I think we need to clarify what we mean by backwards compatibility for random numbers. What guarantees should we make from one version to another? Sturla

Robert Kern

12:22 p.m.

On Sun, May 24, 2015 at 7:46 PM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

On 24/05/15 17:13, Anne Archibald wrote:

...
Do we want a deprecation-like approach, so that eventually people who want replicability will specify versions, and everyone else gets bug fixes and improvements? This would presumably take several major versions, but it might avoid people getting unintentionally trapped on this version.

Incidentally, bug fixes are complicated: if a bug fix uses more or fewer raw random numbers, it breaks repeatability not just for the call that got fixed but for all successive random number generations.

If a function has a bug, changing it will change the output of the function. This is not special for random numbers. If not retaining the old erroneous output means we break-backwards compatibility, then no bugs can ever be fixed, anywhere in NumPy. I think we need to clarify what we mean by backwards compatibility for random numbers. What guarantees should we make from one version to another?

The policy thus far has been that we will fix bugs in the distributions and make changes that allow a strictly wider domain of distribution parameters (e.g. allowing b==0 where before we only allowed b>0), but we will not make other enhancements that would change existing good output. -- Robert Kern

Sturla Molden

1:30 p.m.

On 24/05/15 10:22, Antony Lee wrote:

...

Comments, and help for writing tests (in particular to make sure backwards compatibility is maintained) are welcome.

I have one comment, and that is what makes random numbers so special? This applies to the rest of NumPy too, fixing a bug can sometimes change the output of a function. Personally I think we should only make guarantees about the data types, array shapes, and things like that, but not about the values. Those who need a particular version of NumPy for exact reproducibility should install the version of Python and NumPy they need. That is why virtual environments exist. I am sure a lot will disagree with me on this. So please don't take this as flamebait. Sturla

Antony Lee

2:09 p.m.

2015-05-24 13:30 GMT-07:00 Sturla Molden <sturla.molden@gmail.com>:

...

On 24/05/15 10:22, Antony Lee wrote:

...
Comments, and help for writing tests (in particular to make sure backwards compatibility is maintained) are welcome.

I have one comment, and that is what makes random numbers so special? This applies to the rest of NumPy too, fixing a bug can sometimes change the output of a function.

Personally I think we should only make guarantees about the data types, array shapes, and things like that, but not about the values. Those who need a particular version of NumPy for exact reproducibility should install the version of Python and NumPy they need. That is why virtual environments exist.

I personally agree with this point of view (see original discussion in #5299, for example); if it was only up to me at least I'd make RandomState(seed) default to the latest version rather than the original one (whether to keep the old versions around is another question). On the other hand, I see that this long-standing debate has prevented obvious improvements from being added sometimes for years (e.g. a patch for Ziggurat normal variates has been lying around since 2010), or led to potential API duplication in order to fix some clearly undesirable behavior (dirichlet returning "nan" being described as "in a strict sense not really a bug"(!)), so I'm willing to compromise to get this moving forward. Antony

josef.pktd＠gmail.com

2:49 p.m.

On Sun, May 24, 2015 at 5:09 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

2015-05-24 13:30 GMT-07:00 Sturla Molden <sturla.molden@gmail.com>:

...
On 24/05/15 10:22, Antony Lee wrote:

...
Comments, and help for writing tests (in particular to make sure backwards compatibility is maintained) are welcome.

I have one comment, and that is what makes random numbers so special? This applies to the rest of NumPy too, fixing a bug can sometimes change the output of a function.

Personally I think we should only make guarantees about the data types, array shapes, and things like that, but not about the values. Those who need a particular version of NumPy for exact reproducibility should install the version of Python and NumPy they need. That is why virtual environments exist.

I personally agree with this point of view (see original discussion in #5299, for example); if it was only up to me at least I'd make RandomState(seed) default to the latest version rather than the original one (whether to keep the old versions around is another question). On the other hand, I see that this long-standing debate has prevented obvious improvements from being added sometimes for years (e.g. a patch for Ziggurat normal variates has been lying around since 2010), or led to potential API duplication in order to fix some clearly undesirable behavior (dirichlet returning "nan" being described as "in a strict sense not really a bug"(!)), so I'm willing to compromise to get this moving forward.

It's clearly a different kind of "bug" than some of the ones we fixed in the past without backwards compatibility discussion where the distribution was wrong, i.e. some values shifted so parts have more weight and parts have less weight. As I mentioned, I don't see any real problem with the proposal. Josef

...

Antony

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Daπid

4:14 a.m.

On 24 May 2015 at 22:30, Sturla Molden <sturla.molden@gmail.com> wrote:

...

Personally I think we should only make guarantees about the data types, array shapes, and things like that, but not about the values. Those who need a particular version of NumPy for exact reproducibility should install the version of Python and NumPy they need. That is why virtual environments exist.

But there is a lot of legacy code out there that doesn't specify the version required; and in most cases the original author cannot even be asked. Tests are a particularly annoying case. For example, when testing an algorithm, is usually a good practice to record the number of iterations as well as the result; consider it an early warning that we have changed something we possibly didn't mean to, even if the result is correct. If we want to support several NumPy versions, and the algorithm has any randomness, the tests would have to be duplicated, or find a seed that gives the exact same results. Thus, keeping different versions lets us compare the results against the old API, without needing to duplicate the tests. A lot less people will get annoyed. /David.

Antony Lee

2:06 p.m.

...

A proof-of-concept implementation, still missing tests, is tracked as #5911. It includes the patch proposed in #5158 as an example of how to include an improved version of random.choice.

Tests are in now (whether we should bundle in pickles of old versions to make sure they are still unpickled correctly and outputs of old random streams to make sure they are still reproduced is a good question, though). Comments welcome. Antony

Antony Lee

June 2015

10:07 a.m.

2015-05-29 14:06 GMT-07:00 Antony Lee <antony.lee@berkeley.edu>:

...

A proof-of-concept implementation, still missing tests, is tracked as

...
#5911. It includes the patch proposed in #5158 as an example of how to include an improved version of random.choice.

Tests are in now (whether we should bundle in pickles of old versions to make sure they are still unpickled correctly and outputs of old random streams to make sure they are still reproduced is a good question, though). Comments welcome.

Kindly bumping the issue. Antony

3539

Age (days ago)

3555

Last active (days ago)

List overview

Download

24 comments

9 participants

participants (9)

Alan G Isaac
Anne Archibald
Antony Lee
Daπid
josef.pktd＠gmail.com
Nathaniel Smith
Ralf Gommers
Robert Kern
Sturla Molden

Backwards-incompatible improvements to numpy.random.RandomState

Antony Lee

Anne Archibald

Sturla Molden

Antony Lee

Sturla Molden

Sturla Molden

Antony Lee

Antony Lee

Antony Lee

tags

participants (9)