[Numpy-discussion] NEP: Random Number Generator Policy

Ralf Gommers ralf.gommers at gmail.com
Mon Jun 11 02:53:07 EDT 2018


On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Sun, Jun 10, 2018 at 8:11 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >
> > On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >> >
> >> > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser <
> warren.weckesser at gmail.com> wrote:
> >>
> >> >> I suspect many of the tests will be easy to update, so fixing 300 or
> so tests does not seem like a monumental task.
> >> >
> >> > It's all not monumental, but it adds up quickly. In addition to
> changing tests, one will also need compatibility code when supporting
> multiple numpy versions (e.g. scipy when get a copy of RandomStable in
> scipy/_lib/_numpy_compat.py).
> >> >
> >> > A quick count of just np.random.seed occurrences with ``$ grep -roh
> --include \*.py np.random.seed . | wc -w`` for some packages:
> >> > numpy: 77
> >> > scipy: 462
> >> > matplotlib: 204
> >> > statsmodels: 461
> >> > pymc3: 36
> >> > scikit-image: 63
> >> > scikit-learn: 69
> >> > keras: 46
> >> > pytorch: 0
> >> > tensorflow: 368
> >> > astropy: 24
> >> >
> >> > And note, these are *not* incorrect/broken usages, this is code that
> works and has done so for years.
> >>
> >> Yes, some of them are incorrect and broken. Failure can be difficult to
> detect. This module from keras is particularly problematic:
> >>
> >>   https://github.com/keras-team/keras-preprocessing/blob/
> master/keras_preprocessing/image.py
> >
> > You have to appreciate that we're not all thinking at lightning speed
> and in the same direction. If there is a difficult to detect problem, it
> may be useful to give a brief code example (or even line of reasoning) of
> how this actually breaks something.
>
> Ahem. Sorry. That wasn't the code I was thinking of. It's merely
> hazardous, not broken by itself. However, if you used any of the `seed=`
> arguments that are helpfully(?) provided, you are almost certainly writing
> broken code. If you must use np.random.seed() to get reproducibility, you
> need to call it exactly once at the start of your code (or maybe once for
> each process) and let it ride.
>
> This is the impossible-to-use-correctly code that I was thinking of, which
> got partially fixed after I pointed out the problem.
>
>   https://github.com/keras-team/keras/pull/8325/files
>
> The intention of this code is to shuffle two same-length sequences in the
> same way. So now if I write my code well to call np.random.seed() once at
> the start of my program, this function comes along and obliterates that
> with a fixed seed just so it can reuse the seed again to replicate the
> shuffle.
>

Yes, that's a big no-no. There are situations conceivable where a library
has to set a seed, but I think the right pattern in that case would be
something like

old_state = np.random.get_state()
np.random.seed(some_int)
do_stuff()
np.random.set_state(**old._state)


> Puzzlingly, the root sin of unconditionally and unavoidably reseeding for
> some of these functions is still there even though I showed how and why to
> avoid it. This is one reason why I was skeptical that merely documenting
> RandomState or StableRandom to only be used for unit tests would work. :-)
>

Well, no matter what we do, I'm sure that there'll be lots of people who
will still get it wrong:)


> >> > Conclusion: the current proposal will cause work for the vast
> majority of libraries that depends on numpy. The total amount of that work
> will certainly not be counted in person-days/weeks, and more likely in
> years than months. So I'm not convinced yet that the current proposal is
> the best way forward.
> >>
> >> The mere usage of np.random.seed() doesn't imply that these packages
> actually require stream-compatibility. Some might, for sure, like where
> they are used in the unit tests, but that's not what you counted. At best,
> these numbers just mean that we can't eliminate np.random.seed() in a new
> system right away.
> >
> > Well, mere usage has been called an antipattern (also on your behalf),
> plus for scipy over half of the usages do give test failures (Warren's
> quick test). So I'd say that counting usages is a decent proxy for the work
> that has to be done.
>
> Sure. But with my new proposal, we don't have to change it (as much as I'd
> like to!). I'll draft up a PR to modify my NEP accordingly.
>

Sounds good!

Cheers,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180610/dada43fb/attachment.html>


More information about the NumPy-Discussion mailing list