[Numpy-discussion] NEP: Random Number Generator Policy

Robert Kern robert.kern at gmail.com
Mon Jun 11 02:15:44 EDT 2018


On Sun, Jun 10, 2018 at 8:11 PM Ralf Gommers <ralf.gommers at gmail.com> wrote:
>
> On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers <ralf.gommers at gmail.com>
wrote:
>> >
>> > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser <
warren.weckesser at gmail.com> wrote:
>>
>> >> I suspect many of the tests will be easy to update, so fixing 300 or
so tests does not seem like a monumental task.
>> >
>> > It's all not monumental, but it adds up quickly. In addition to
changing tests, one will also need compatibility code when supporting
multiple numpy versions (e.g. scipy when get a copy of RandomStable in
scipy/_lib/_numpy_compat.py).
>> >
>> > A quick count of just np.random.seed occurrences with ``$ grep -roh
--include \*.py np.random.seed . | wc -w`` for some packages:
>> > numpy: 77
>> > scipy: 462
>> > matplotlib: 204
>> > statsmodels: 461
>> > pymc3: 36
>> > scikit-image: 63
>> > scikit-learn: 69
>> > keras: 46
>> > pytorch: 0
>> > tensorflow: 368
>> > astropy: 24
>> >
>> > And note, these are *not* incorrect/broken usages, this is code that
works and has done so for years.
>>
>> Yes, some of them are incorrect and broken. Failure can be difficult to
detect. This module from keras is particularly problematic:
>>
>>
https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image.py
>
> You have to appreciate that we're not all thinking at lightning speed and
in the same direction. If there is a difficult to detect problem, it may be
useful to give a brief code example (or even line of reasoning) of how this
actually breaks something.

Ahem. Sorry. That wasn't the code I was thinking of. It's merely hazardous,
not broken by itself. However, if you used any of the `seed=` arguments
that are helpfully(?) provided, you are almost certainly writing broken
code. If you must use np.random.seed() to get reproducibility, you need to
call it exactly once at the start of your code (or maybe once for each
process) and let it ride.

This is the impossible-to-use-correctly code that I was thinking of, which
got partially fixed after I pointed out the problem.

  https://github.com/keras-team/keras/pull/8325/files

The intention of this code is to shuffle two same-length sequences in the
same way. So now if I write my code well to call np.random.seed() once at
the start of my program, this function comes along and obliterates that
with a fixed seed just so it can reuse the seed again to replicate the
shuffle.

Puzzlingly, the root sin of unconditionally and unavoidably reseeding for
some of these functions is still there even though I showed how and why to
avoid it. This is one reason why I was skeptical that merely documenting
RandomState or StableRandom to only be used for unit tests would work. :-)

>> > Conclusion: the current proposal will cause work for the vast majority
of libraries that depends on numpy. The total amount of that work will
certainly not be counted in person-days/weeks, and more likely in years
than months. So I'm not convinced yet that the current proposal is the best
way forward.
>>
>> The mere usage of np.random.seed() doesn't imply that these packages
actually require stream-compatibility. Some might, for sure, like where
they are used in the unit tests, but that's not what you counted. At best,
these numbers just mean that we can't eliminate np.random.seed() in a new
system right away.
>
> Well, mere usage has been called an antipattern (also on your behalf),
plus for scipy over half of the usages do give test failures (Warren's
quick test). So I'd say that counting usages is a decent proxy for the work
that has to be done.

Sure. But with my new proposal, we don't have to change it (as much as I'd
like to!). I'll draft up a PR to modify my NEP accordingly.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180610/a04aedd7/attachment-0001.html>


More information about the NumPy-Discussion mailing list