[Numpy-discussion] numpy.random and multiprocessing

Thu Dec 11 10:57:26 EST 2008

On Fri, Dec 12, 2008 at 12:20 AM, Gael Varoquaux
<gael.varoquaux at normalesup.org> wrote:
> Hi there,
>
> I have been using the multiprocessing module a lot to do statistical tests
> such as Monte Carlo or resampling, and I have just discovered something
> that makes me wonder if I haven't been accumulating false results. Given
> two files:
>
> === test.py ===
> from test_helper import task
> from multiprocessing import Pool
>
> p = Pool(4)
>
> jobs = list()
> for i in range(4):
>    jobs.append(p.apply_async(task, (4, )))
>
> print [j.get() for j in jobs]
>
> p.close()
> p.join()
>
> === test_helper.py ===
> import numpy as np
>
> def task(x):
>    return np.random.random(x)
>
> =======
>
> If I run test.py, I get:
>
> [array([ 0.35773964,  0.63945684,  0.50855196,  0.08631373]), array([
> 0.35773964,  0.63945684,  0.50855196,  0.08631373]), array([ 0.35773964,
> 0.63945684,  0.50855196,  0.08631373]), array([ 0.65357725,  0.35649382,
> 0.02203999,  0.7591353 ])]
>
> In other words, the 4 processes give me the same exact results.

Why do you say the results are the same ? They don't look the same to
me - only the first three are the same.

> Now I understand why this is the case: the different instances of the
> random number generator where created by forking from the same process,
> so they are exactly the very same object. This is howver a fairly bad
> trap. I guess other people will fall into it.

I am not sure I am following: the objects in python are not the same
if you fork a process, or I don't understand what you mean by same.
They may be initialized the same way, though.

Isn't the problem simply due to seeding from the same value ? For such
a tiny problem (4 tasks whose processing time is negligeable), the
seed will be the same since the intervals between the sampling will be
small.

Taking a look at the mtrand code in numpy, if the seed is not given,
it is taken from /dev/random if available, or the time clock if not; I
don't know what the semantics are for concurrent access to /dev/random
(is it gauranteed that two process will get different values from it
?).

To confirm this, you could try to use your toy example with 500 jobs
instead of 4: in that case, it is unlikely they use the same
underlying value as a starting point, even if there is no gurantee on
concurrent access of /dev/random.

> I wonder if we can find a way to make this more user friendly? Would be
> easy, in the C code, to check if the PID has changed, and if so reseed
> the random number generator? I can open up a ticket for this if people
> think this is desirable (I think so).

This sounds like too much magic for a very particular use: there may
be cases where you want the same seed in multiple processes (what if
you processes are not created from multiprocess, and you want to make
sure you have the same seed ?).

David