Mailman 3 numpy.random and multiprocessing - NumPy-Discussion

numpy.random and multiprocessing

Gael Varoquaux

11 Dec 2008 11 Dec '08

3:20 p.m.

Hi there, I have been using the multiprocessing module a lot to do statistical tests such as Monte Carlo or resampling, and I have just discovered something that makes me wonder if I haven't been accumulating false results. Given two files: === test.py === from test_helper import task from multiprocessing import Pool p = Pool(4) jobs = list() for i in range(4): jobs.append(p.apply_async(task, (4, ))) print [j.get() for j in jobs] p.close() p.join() === test_helper.py === import numpy as np def task(x): return np.random.random(x) ======= If I run test.py, I get: [array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382, 0.02203999, 0.7591353 ])] In other words, the 4 processes give me the same exact results. Now I understand why this is the case: the different instances of the random number generator where created by forking from the same process, so they are exactly the very same object. This is howver a fairly bad trap. I guess other people will fall into it. The take home message is: **call 'numpy.random.seed()' when you are using multiprocessing** I wonder if we can find a way to make this more user friendly? Would be easy, in the C code, to check if the PID has changed, and if so reseed the random number generator? I can open up a ticket for this if people think this is desirable (I think so). On a side note, there are a score of functions in numpy.random with __module__ to None. It makes it inconvenient to use it with multiprocessing (for instance it forced the creation of the 'test_helper' file here). Gaël

Show replies by date

David Cournapeau

11 Dec 11 Dec

3:57 p.m.

On Fri, Dec 12, 2008 at 12:20 AM, Gael Varoquaux wrote:

...

Hi there,

I have been using the multiprocessing module a lot to do statistical tests such as Monte Carlo or resampling, and I have just discovered something that makes me wonder if I haven't been accumulating false results. Given two files:

=== test.py === from test_helper import task from multiprocessing import Pool

p = Pool(4)

jobs = list() for i in range(4): jobs.append(p.apply_async(task, (4, )))

print [j.get() for j in jobs]

p.close() p.join()

=== test_helper.py === import numpy as np

def task(x): return np.random.random(x)

=======

If I run test.py, I get:

[array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382, 0.02203999, 0.7591353 ])]

In other words, the 4 processes give me the same exact results.

Why do you say the results are the same ? They don't look the same to me - only the first three are the same.

...

Now I understand why this is the case: the different instances of the random number generator where created by forking from the same process, so they are exactly the very same object. This is howver a fairly bad trap. I guess other people will fall into it.

I am not sure I am following: the objects in python are not the same if you fork a process, or I don't understand what you mean by same. They may be initialized the same way, though. Isn't the problem simply due to seeding from the same value ? For such a tiny problem (4 tasks whose processing time is negligeable), the seed will be the same since the intervals between the sampling will be small. Taking a look at the mtrand code in numpy, if the seed is not given, it is taken from /dev/random if available, or the time clock if not; I don't know what the semantics are for concurrent access to /dev/random (is it gauranteed that two process will get different values from it ?). To confirm this, you could try to use your toy example with 500 jobs instead of 4: in that case, it is unlikely they use the same underlying value as a starting point, even if there is no gurantee on concurrent access of /dev/random.

...

I wonder if we can find a way to make this more user friendly? Would be easy, in the C code, to check if the PID has changed, and if so reseed the random number generator? I can open up a ticket for this if people think this is desirable (I think so).

This sounds like too much magic for a very particular use: there may be cases where you want the same seed in multiple processes (what if you processes are not created from multiprocess, and you want to make sure you have the same seed ?). David

David Cournapeau

4:04 p.m.

On Fri, Dec 12, 2008 at 12:57 AM, David Cournapeau wrote:

...

Taking a look at the mtrand code in numpy, if the seed is not given, it is taken from /dev/random if available, or the time clock if not; I don't know what the semantics are for concurrent access to /dev/random (is it gauranteed that two process will get different values from it ?).

Sorry, the mtrand code use /dev/urandom, not /dev/random, if available. David

Pauli Virtanen

4:16 p.m.

Fri, 12 Dec 2008 00:57:26 +0900, David Cournapeau wrote: [clip]

...

On Fri, Dec 12, 2008 at 12:20 AM, Gael Varoquaux wrote: [clip]

...
Now I understand why this is the case: the different instances of the random number generator where created by forking from the same process, so they are exactly the very same object. This is howver a fairly bad trap. I guess other people will fall into it.

I am not sure I am following: the objects in python are not the same if you fork a process, or I don't understand what you mean by same. They may be initialized the same way, though.

The RandomState object handling numpy.random.random is created (and seeded) at import time. So, an identical generator should be shared by all processes after that. -- Pauli Virtanen

Sturla Molden

4:23 p.m.

On 12/11/2008 4:57 PM, David Cournapeau wrote:

...

Why do you say the results are the same ? They don't look the same to me - only the first three are the same.

He used the multiprocessing.Pool object. There is a possible race condition here: one or more of the forked processes may be doing nothing. They are all competing for tasks on a queue. It could be avoided by using multiprocessing.Process instead.

...

I am not sure I am following: the objects in python are not the same if you fork a process, or I don't understand what you mean by same. They may be initialized the same way, though.

When are they initialized? On import numpy or the first call to numpy.random.random? If they are initialized on the import numpy statement, they are initalized prior to forking and sharing state. This is because his statement 'from test_helper import task' actually triggers the import of numpy, and it occurs prior to any fork. This is also system dependent by the way. On Windows multiprocessing does not fork() and does not produce this problem. Sturla Molden

Gael Varoquaux

4:39 p.m.

On Thu, Dec 11, 2008 at 05:23:12PM +0100, Sturla Molden wrote:

...

On 12/11/2008 4:57 PM, David Cournapeau wrote:

...

...
Why do you say the results are the same ? They don't look the same to me - only the first three are the same.

...

He used the multiprocessing.Pool object. There is a possible race condition here: one or more of the forked processes may be doing nothing. They are all competing for tasks on a queue. It could be avoided by using multiprocessing.Process instead.

No, Pool is what I want, because in my production code I am submitting jobs to that pool.

...

...
I am not sure I am following: the objects in python are not the same if you fork a process, or I don't understand what you mean by same. They may be initialized the same way, though.

...

When are they initialized? On import numpy or the first call to numpy.random.random?

mtrand.pyx seems pretty clear about that: on import.

...

If they are initialized on the import numpy statement, they are initalized prior to forking and sharing state. This is because his statement 'from test_helper import task' actually triggers the import of numpy, and it occurs prior to any fork.

This is what I thought too. However, inserting a sleep statement long-enough in my spawning loop recovers entropy. I am confused. Gaël

Sturla Molden

4:55 p.m.

On 12/11/2008 5:39 PM, Gael Varoquaux wrote:

...

...
...
Why do you say the results are the same ? They don't look the same to me - only the first three are the same.

...
He used the multiprocessing.Pool object. There is a possible race condition here: one or more of the forked processes may be doing nothing. They are all competing for tasks on a queue. It could be avoided by using multiprocessing.Process instead.

No, Pool is what I want, because in my production code I am submitting jobs to that pool.

Sure, a pool is fine. I was just speculating that one of the four processes in your pool was idle all the time; i.e. that one of the other three got to do the task twice. Therefore you only got three identical results and not four. It depends on how the OS schedules the processes, the number of logical CPUs, etc. You have no control over that. But if you had used N instances of multiprocessing.Pool instead, all N results should have been identical (if the 'random' generator is completely deterministic) - because each process would do the task once. I.e. you only got three indentical results due to a race condition in the task queue. But you don't want similar results do you? So if you remember to seed the random number generators after forking, this race condition should be of no significance.

...

mtrand.pyx seems pretty clear about that: on import.

In which case they are initialized prior to forking. Sturla Molden

Gael Varoquaux

4:59 p.m.

On Thu, Dec 11, 2008 at 05:55:58PM +0100, Sturla Molden wrote:

...

...
No, Pool is what I want, because in my production code I am submitting jobs to that pool.

...

Sure, a pool is fine. I was just speculating that one of the four processes in your pool was idle all the time; i.e. that one of the other three got to do the task twice. Therefore you only got three identical results and not four. It depends on how the OS schedules the processes, the number of logical CPUs, etc. You have no control over that. But if you had used N instances of multiprocessing.Pool instead, all N results should have been identical (if the 'random' generator is completely deterministic) - because each process would do the task once.

...

I.e. you only got three indentical results due to a race condition in the task queue.

Gotcha! Good explanation. Now I understand better my previous investigation. I think you are completely right. So indeed, as I initialy thought, using multiprocessing without reseeding is going to get you in big trouble (and this is what I experienced in my code). Thanks for the explanation, Gaël

Pauli Virtanen

5:03 p.m.

Thu, 11 Dec 2008 17:55:58 +0100, Sturla Molden wrote: [clip]

...

Sure, a pool is fine. I was just speculating that one of the four processes in your pool was idle all the time; i.e. that one of the other three got to do the task twice. Therefore you only got three identical results and not four. It depends on how the OS schedules the processes, the number of logical CPUs, etc. You have no control over that. But if you had used N instances of multiprocessing.Pool instead, all N results should have been identical (if the 'random' generator is completely deterministic) - because each process would do the task once.

I.e. you only got three indentical results due to a race condition in the task queue.

Exactly, change task_helper.py to ---- import numpy as np def task(x): import os print "Hi, I'm", os.getpid() return np.random.random(x) ---- and note the output ---- Hi, I'm 16197 Hi, I'm 16198 Hi, I'm 16199 Hi, I'm 16199 [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.59574921 0.61554857 0.06155764 0.75352295] ---- -- Pauli Virtanen

Michael Gilbert

5:10 p.m.

...

Exactly, change task_helper.py to

---- import numpy as np

def task(x): import os print "Hi, I'm", os.getpid() return np.random.random(x) ----

and note the output

---- Hi, I'm 16197 Hi, I'm 16198 Hi, I'm 16199 Hi, I'm 16199 [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.59574921 0.61554857 0.06155764 0.75352295]

Shouldn't numpy (and/or multiprocessing) be smart enough to prevent this kind of error? A simple enough solution would be to also include the process id as part of the seed since it appears that the problem only occurs when you have different processes/threads accessing the random number generator at the same time. Regards, Mike

David Cournapeau

5:04 p.m.

Michael Gilbert wrote:

...

...
Exactly, change task_helper.py to

---- import numpy as np

def task(x): import os print "Hi, I'm", os.getpid() return np.random.random(x) ----

and note the output

---- Hi, I'm 16197 Hi, I'm 16198 Hi, I'm 16199 Hi, I'm 16199 [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.58175647 0.16293922 0.30488182 0.67367263] [ 0.59574921 0.61554857 0.06155764 0.75352295]

Shouldn't numpy (and/or multiprocessing) be smart enough to prevent this kind of error? A simple enough solution would be to also include the process id as part of the seed since it appears that the problem only occurs when you have different processes/threads accessing the random number generator at the same time.

But the seed is set only once in the above code. So the problem has nothing to do with numpy. I don't think using the pid as a seed is a good idea either - for each task, it should be set to a true random source. David

Sturla Molden

5:21 p.m.

On 12/11/2008 6:10 PM, Michael Gilbert wrote:

...

Shouldn't numpy (and/or multiprocessing) be smart enough to prevent this kind of error? A simple enough solution would be to also include the process id as part of the seed

It would not help, as the seeding is done prior to forking. I am mostly familiar with Windows programming. But what is needed is a fork handler (similar to a system hook in Windows jargon) that sets a new seed in the child process. Could pthread_atfork be used? Sturla Molden

David Cournapeau

5:29 p.m.

Sturla Molden wrote:

...

On 12/11/2008 6:10 PM, Michael Gilbert wrote:

...
Shouldn't numpy (and/or multiprocessing) be smart enough to prevent this kind of error? A simple enough solution would be to also include the process id as part of the seed

It would not help, as the seeding is done prior to forking.

I am mostly familiar with Windows programming. But what is needed is a fork handler (similar to a system hook in Windows jargon) that sets a new seed in the child process.

Could pthread_atfork be used?

The seed could be explicitly set in each task, no ? def task(x): np.random.seed() return np.random.random(x) But does this really make sense ? Is the goal to parallelize a big sampler into N tasks of M trials, to produce the same result as a sequential set of M*N trials ? Then it does sound like a trivial task at all. I know there exists libraries explicitly designed for parallel random number generation - maybe this is where we should look, instead of using heuristics which are likely to be bogus, and generate wrong results. cheers, David

Gael Varoquaux

5:49 p.m.

On Fri, Dec 12, 2008 at 02:29:55AM +0900, David Cournapeau wrote:

...

The seed could be explicitly set in each task, no ?

...

def task(x): np.random.seed() return np.random.random(x)

Yes. The problem is trivial to solve, once you are aware of it. Just like the integer division problems we used to have back in the days where zeros, ones, ... returned ineger arrays. The point is that people will run into that problem and loose a lot of time. So we must make it so that they don't by mistake land in this situation, but purposely. One solution is to check the PID of the process when the PRNG is called, and reseed if it has changed. As pointed out, the danger of this is that this is magic, so there needs to be an option to turn this off. Gaël

David Cournapeau

6:34 p.m.

On Fri, Dec 12, 2008 at 2:49 AM, Gael Varoquaux wrote:

...

On Fri, Dec 12, 2008 at 02:29:55AM +0900, David Cournapeau wrote:

...
The seed could be explicitly set in each task, no ?

...
def task(x): np.random.seed() return np.random.random(x)

Yes. The problem is trivial to solve, once you are aware of it. Just like the integer division problems we used to have back in the days where zeros, ones, ... returned ineger arrays. The point is that people will run into that problem and loose a lot of time. So we must make it so that they don't by mistake land in this situation, but purposely.

One solution is to check the PID of the process when the PRNG is called, and reseed if it has changed. As pointed out, the danger of this is that this is magic, so there needs to be an option to turn this off.

The biggest danger is that the whole method may not make sense at all, and lose all the properties of a good random number generator. I don't understand your comparison with integer division: this is not an API or expected behavior problem, but an algorithmic one. David

Bruce Southey

6 p.m.

David Cournapeau wrote:

...

Sturla Molden wrote:

...
On 12/11/2008 6:10 PM, Michael Gilbert wrote:

...
Shouldn't numpy (and/or multiprocessing) be smart enough to prevent this kind of error? A simple enough solution would be to also include the process id as part of the seed

It would not help, as the seeding is done prior to forking.

I am mostly familiar with Windows programming. But what is needed is a fork handler (similar to a system hook in Windows jargon) that sets a new seed in the child process.

Could pthread_atfork be used?

The seed could be explicitly set in each task, no ?

def task(x): np.random.seed() return np.random.random(x)

But does this really make sense ?

Is the goal to parallelize a big sampler into N tasks of M trials, to produce the same result as a sequential set of M*N trials ? Then it does sound like a trivial task at all. I know there exists libraries explicitly designed for parallel random number generation - maybe this is where we should look, instead of using heuristics which are likely to be bogus, and generate wrong results.

cheers,

David _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

This is not sufficient because you can not ensure that the seed will be different every time task() is called. A major part of the problem here is treating a parallel computing problem as a serial computing problem. The streams must be independent across threads especially avoiding cross-correlation of streams (another gotcha) between threads. It is up to the user to implement a thread-safe solution such as using a single stream that is used by all threads or force the different threads to start at different states. The only thing that Numpy could do is provide a parallel pseudo-random number generator. Bruce

David Cournapeau

6:31 p.m.

On Fri, Dec 12, 2008 at 3:00 AM, Bruce Southey wrote:

...

David Cournapeau wrote:

...
Sturla Molden wrote:

...
On 12/11/2008 6:10 PM, Michael Gilbert wrote:

...
Shouldn't numpy (and/or multiprocessing) be smart enough to prevent this kind of error? A simple enough solution would be to also include the process id as part of the seed

It would not help, as the seeding is done prior to forking.

I am mostly familiar with Windows programming. But what is needed is a fork handler (similar to a system hook in Windows jargon) that sets a new seed in the child process.

Could pthread_atfork be used?

The seed could be explicitly set in each task, no ?

def task(x): np.random.seed() return np.random.random(x)

But does this really make sense ?

Is the goal to parallelize a big sampler into N tasks of M trials, to produce the same result as a sequential set of M*N trials ? Then it does sound like a trivial task at all. I know there exists libraries explicitly designed for parallel random number generation - maybe this is where we should look, instead of using heuristics which are likely to be bogus, and generate wrong results.

cheers,

David _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

This is not sufficient because you can not ensure that the seed will be different every time task() is called.

Yes, right. I was assuming that each seed call would result in a /dev/urandom read - but the problem is the same whether it is done in task or in a pthread_atfork method anyway.

...

The only thing that Numpy could do is provide a parallel pseudo-random number generator.

Yes, exactly - hence my question whether this makes sense at all. Even having different, "truely" random seeds does not guarantee that the whole method makes sense - at least, I don't see why it should. In particular, if the process should give the same result independently of the number of parallels tasks, the problem becomes difficult. Intrigued by the problem, I briefly looked into the literature for parallel RNG; it certainly does not look like an easy task, and the chance of getting it right without knowing about the topic does not look high. cheers, David

Sturla Molden

6:04 p.m.

On 12/11/2008 6:29 PM, David Cournapeau wrote:

...

def task(x): np.random.seed() return np.random.random(x)

But does this really make sense ?

Hard to say... There is a chance of this producing indentical or overlapping sequences, albeit unlikely. I would not do this. I'd make one process responsible for making the random numbers and write those to a queue. It would scale if generating the deviates is the least costly part of the algorithm. Sturla Molden === test.py === from test_helper import task, generator from multiprocessing import Pool, Process, Queue q = Queue(maxsize=32) # or whatever g = Process(args=(4,q)) # preferably a number much larger than 4!!! g.start() p = Pool(4) jobs = list() for i in range(4): jobs.append(p.apply_async(task, (q,))) print [j.get() for j in jobs] p.close() p.join() g.terminate() === test_helper.py === import numpy as np def generator(x, q): while 1: item = np.random.random(x) q.put(item) def task(q): return q.get()

...

Is the goal to parallelize a big sampler into N tasks of M trials, to produce the same result as a sequential set of M*N trials ? Then it does sound like a trivial task at all. I know there exists libraries explicitly designed for parallel random number generation - maybe this is where we should look, instead of using heuristics which are likely to be bogus, and generate wrong results.

cheers,

David _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

6:39 p.m.

...

...
Is the goal to parallelize a big sampler into N tasks of M trials, to produce the same result as a sequential set of M*N trials ? Then it does sound like a trivial task at all. I know there exists libraries explicitly designed for parallel random number generation - maybe this is where we should look, instead of using heuristics which are likely to be bogus, and generate wrong results.

Another heuristic using pseudo random seed for each process Generate random integers (large) in the main process, and send it as seeds to each task. This makes it replicable if the initial seed is set, and should have independent "pseudo" random numbers in each stream. This works in probability theory, but I don't know about the quality of RNGs. Josef

Sturla Molden

5:36 p.m.

On 12/11/2008 6:21 PM, Sturla Molden wrote:

...

It would not help, as the seeding is done prior to forking.

I am mostly familiar with Windows programming. But what is needed is a fork handler (similar to a system hook in Windows jargon) that sets a new seed in the child process.

Actually I am not sure this should be done, as this issue technically speaking is not an error. A warning in the documentation would be better. Perhaps we should we should write a proper numpy + multiprocessing tutorial? Sturla Molden

Gael Varoquaux

4:36 p.m.

On Fri, Dec 12, 2008 at 12:57:26AM +0900, David Cournapeau wrote:

...

...
[array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382, 0.02203999, 0.7591353 ])]

...

...
In other words, the 4 processes give me the same exact results.

...

Why do you say the results are the same ? They don't look the same to me - only the first three are the same.

Correct. I wonder why. When I try on my box currently I almost always get the same four. But not all the time. More on that below.

...

...
Now I understand why this is the case: the different instances of the random number generator where created by forking from the same process, so they are exactly the very same object. This is howver a fairly bad trap. I guess other people will fall into it.

...

I am not sure I am following: the objects in python are not the same if you fork a process, or I don't understand what you mean by same. They may be initialized the same way, though.

Yes, they are initiate with the same seed value. I call them the same because right after the fork they are. The can evolve separately, though. However our PRNG is completely defined by its seed, AFAIK.

...

Isn't the problem simply due to seeding from the same value ? For such a tiny problem (4 tasks whose processing time is negligeable), the seed will be the same since the intervals between the sampling will be small.

Right, but I found the problem in real code, that was not tiny at all.

...

Taking a look at the mtrand code in numpy, if the seed is not given, it is taken from /dev/random if available, or the time clock if not; I don't know what the semantics are for concurrent access to /dev/random (is it gauranteed that two process will get different values from it ?).

...

To confirm this, you could try to use your toy example with 500 jobs instead of 4: in that case, it is unlikely they use the same underlying value as a starting point, even if there is no gurantee on concurrent access of /dev/random.

I found the problem on way bigger code. I have only 8 cpus, so I run 8 jobs, and each job loops on the tasks. I noticed that the variance was much smaller than expected. The jobs take 10 minutes, so you can't call them tiny or fast. The problem indeed appears in production code. The way I interpret this is that the seed is created only at module-import time (this is how I read the code in mtrand.pyx). For all my processes, the seed was created when numpy was imported in the mother process. After the fork, the seed is the same in each process. As a result the entropy of the whole system is clearly not the entropy of 4 independant systems. As you point out the fourth value in my toy example differs from the others, so somehow my picture is not exact. But it remains that the entropy is way too low in my production code. I don't understand why, once in a while, there is a value that is different. That could be because numpy is reimported in the child processes. If I insert a 'time.sleep' in my for loop that spawns the processes, I get significantly higher entropy only if the sleep is around 1 second. Looking at the seed code (rk_randomseed in randomkit.c), it seems that /dev/urandom is not used, contrary to what the random.seed docstring pretends, and what is really used is gettimeofday under windows, and _ftime under Unix. It does seem, though that the milliseconds are used. I must admit I don't fully understand why this happens. I thought that: a) Modules where not reimported with multiprocess, thanks to the fork. If this where true, reading mtrand.pyx, all subprocesses should have the same seed. b) /dev/urandom was used to seed. This seems wrong. Reading the code shows no dev/urandom in the seeding parts. c) milliseconds where used, so we should be rather safe from these race-condition. The code does seem to hint toward that, but if I add a sleep(0.01) to my loop, I don't get enough entropy. I did check that sleep(0.01) was sleeping at least 0.01 seconds.

...

...
I wonder if we can find a way to make this more user friendly? Would be easy, in the C code, to check if the PID has changed, and if so reseed the random number generator? I can open up a ticket for this if people think this is desirable (I think so).

...

This sounds like too much magic for a very particular use: there may be cases where you want the same seed in multiple processes (what if you processes are not created from multiprocess, and you want to make sure you have the same seed ?).

Well, yes, for code that wants to explicitely control the seed, ressed automaticaly would be a problem, and we need to figure out a way to make this deterministic (eg for testing purposes). However, this is a small usecase, and when testing people need to be aware of seeding problems (although they might not understand fork semantics). More and more people are going to be using multiprocessing: it comes with the standard library, and standard boxes nowadays have many cores, and will soon have much more. Resampling and brute-force Monte Carlo techniques are embarrassingly parallel, so people will want to use parallel computing on them. I fear many others are going to fall in this trap. Gaël

Gael Varoquaux

4:46 p.m.

On Thu, Dec 11, 2008 at 05:36:47PM +0100, Gael Varoquaux wrote:

...

b) /dev/urandom was used to seed. This seems wrong. Reading the code shows no dev/urandom in the seeding parts.

Actually, I am wrong here. dev/urandom is indeed used in 'rk_devfill', used in the seeding routine. It seems this is not enough. Gaël

Bruce Southey

4:20 p.m.

Gael Varoquaux wrote:

...

Hi there,

I have been using the multiprocessing module a lot to do statistical tests such as Monte Carlo or resampling, and I have just discovered something that makes me wonder if I haven't been accumulating false results. Given two files:

=== test.py === from test_helper import task from multiprocessing import Pool

p = Pool(4)

jobs = list() for i in range(4): jobs.append(p.apply_async(task, (4, )))

print [j.get() for j in jobs]

p.close() p.join()

=== test_helper.py === import numpy as np

def task(x): return np.random.random(x)

=======

If I run test.py, I get:

[array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382, 0.02203999, 0.7591353 ])]

In other words, the 4 processes give me the same exact results.

Now I understand why this is the case: the different instances of the random number generator where created by forking from the same process, so they are exactly the very same object. This is howver a fairly bad trap. I guess other people will fall into it.

The take home message is: **call 'numpy.random.seed()' when you are using multiprocessing**

I wonder if we can find a way to make this more user friendly? Would be easy, in the C code, to check if the PID has changed, and if so reseed the random number generator? I can open up a ticket for this if people think this is desirable (I think so).

On a side note, there are a score of functions in numpy.random with __module__ to None. It makes it inconvenient to use it with multiprocessing (for instance it forced the creation of the 'test_helper' file here).

Gaël _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Part of this is one of the gotcha's of simulation that is not specific to multiprocessing and Python. Just highly likely to occur in your case with multiprocessing but does occur in single processing. As David indicated, many applications use a single source (often computer time) to initialize the pseudo-random generators if an actual seed is not supplied. Depending on the resolution as most require an integer so minor changes may not be sufficient to change the seed. So the same seed will get used if the source has not sufficiently 'advanced' before the next initialization. If you really care about reproducing the streams, you should specify the seed anyhow. Bruce

Gael Varoquaux

4:45 p.m.

On Thu, Dec 11, 2008 at 10:20:48AM -0600, Bruce Southey wrote:

...

Part of this is one of the gotcha's of simulation that is not specific to multiprocessing and Python. Just highly likely to occur in your case with multiprocessing but does occur in single processing. As David indicated, many applications use a single source (often computer time) to initialize the pseudo-random generators if an actual seed is not supplied. Depending on the resolution as most require an integer so minor changes may not be sufficient to change the seed. So the same seed will get used if the source has not sufficiently 'advanced' before the next initialization.

...

If you really care about reproducing the streams, you should specify the seed anyhow.

Well, its not about me. I have found this out, now, so I will know. Its about many other people who are going to stumble upon this. I don't think it is a good idea to count on the fact that people will understand-enough these problems not to be fooled by them. We should try to reduce that, as much as possible without adding magic that renders the behavior incomprehensible. Gaël

Sturla Molden

7:16 p.m.

I'd just like to add that yet another option would be to use the manager/proxy object in multiprocessing. In this case numpy.random.random will be called in the parent process. I have not used this and I am not sure how efficient it is. But the possibility is there. Sturla Molden === test.py === from test_helper import task, RandomManager from multiprocessing import Pool rm = RandomManager() rm.start() random = rm.Random() p = Pool(4) jobs = list() for i in range(4): jobs.append(p.apply_async(task, (4,random))) print [j.get() for j in jobs] p.close() p.join() rm.shutdown() === test_helper.py === import numpy as np import multiprocessing as mp from mp.managers import BaseManager, CreatorMethod class RandomClass(object): def random(self, x): return np.random.random(x) class RandomManager(BaseManager): Random = CreatorMethod(RandomClass) def task(x, random): return random.random(x) On 12/11/2008 4:20 PM, Gael Varoquaux wrote:

...

Hi there,

I have been using the multiprocessing module a lot to do statistical tests such as Monte Carlo or resampling, and I have just discovered something that makes me wonder if I haven't been accumulating false results. Given two files:

=== test.py === from test_helper import task from multiprocessing import Pool

p = Pool(4)

jobs = list() for i in range(4): jobs.append(p.apply_async(task, (4, )))

print [j.get() for j in jobs]

p.close() p.join()

=== test_helper.py === import numpy as np

def task(x): return np.random.random(x)

=======

If I run test.py, I get:

[array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382, 0.02203999, 0.7591353 ])]

In other words, the 4 processes give me the same exact results.

Now I understand why this is the case: the different instances of the random number generator where created by forking from the same process, so they are exactly the very same object. This is howver a fairly bad trap. I guess other people will fall into it.

The take home message is: **call 'numpy.random.seed()' when you are using multiprocessing**

I wonder if we can find a way to make this more user friendly? Would be easy, in the C code, to check if the PID has changed, and if so reseed the random number generator? I can open up a ticket for this if people think this is desirable (I think so).

On a side note, there are a score of functions in numpy.random with __module__ to None. It makes it inconvenient to use it with multiprocessing (for instance it forced the creation of the 'test_helper' file here).

Gaël _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

7:33 p.m.

Here is the c program and the description how to implement independent Mersenne Twister PRNGs by the inventor(s) of Mersenne Twister: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dc.html I didn't see a license statement. Josef

Sturla Molden

8:57 p.m.

In the docs I found this: "We used a hypothesis that a set of PRNGs based on linear recurrences is mutually 'independent' if the characteristic polynomials are relatively prime to each other. There is no rigorous proof of this hypothesis..." S.M.

...

Here is the c program and the description how to implement independent Mersenne Twister PRNGs by the inventor(s) of Mersenne Twister:

http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dc.html

I didn't see a license statement.

Josef _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

8:49 p.m.

On Thu, Dec 11, 2008 at 07:20, Gael Varoquaux wrote:

...

The take home message is: **call 'numpy.random.seed()' when you are using multiprocessing**

Create RandomState objects and use those. This is a best practice whether you are using multiprocessing or not. The module-level functions really should only be used for noodling around in IPython. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Sturla Molden

9:06 p.m.

...

Create RandomState objects and use those. This is a best practice whether you are using multiprocessing or not. The module-level functions really should only be used for noodling around in IPython.

Are we guaranteed that two RandomStates will produce two independent sequences? If not, RandomState cannot be used for this particular purpose. Cf. what the creators of MT wrote about dynamically creating MT generators at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dc.html Sturla Molden

Robert Kern

9:11 p.m.

On Thu, Dec 11, 2008 at 13:06, Sturla Molden wrote:

...

...
Create RandomState objects and use those. This is a best practice whether you are using multiprocessing or not. The module-level functions really should only be used for noodling around in IPython.

Are we guaranteed that two RandomStates will produce two independent sequences?

No.

...

If not, RandomState cannot be used for this particular purpose.

For small numbers of processes and not-huge runs, I think it's reasonable. You can also implement skipping fairly straightforwardly. If you're in Python, the wasted time is probably a small part of the inefficiencies. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Charles R Harris

12 Dec 12 Dec

7:47 p.m.

On Thu, Dec 11, 2008 at 8:20 AM, Gael Varoquaux < gael.varoquaux@normalesup.org> wrote:

...

Hi there,

I have been using the multiprocessing module a lot to do statistical tests such as Monte Carlo or resampling, and I have just discovered something that makes me wonder if I haven't been accumulating false results. Given two files:

You might also want to contact Bruce Carneal bcarneal@gmail.com, as he did some work on this. He is interested in clustering/multiprocessing simulations and is currently working on a clustering package. Chuck

5613

Age (days ago)

5614

Last active (days ago)

List overview

Download

30 comments

10 participants

participants (10)

Bruce Southey
Charles R Harris
David Cournapeau
David Cournapeau
Gael Varoquaux
josef.pktd＠gmail.com
Michael Gilbert
Pauli Virtanen
Robert Kern
Sturla Molden

numpy.random and multiprocessing

Sturla Molden

Sturla Molden

Michael Gilbert

David Cournapeau

Sturla Molden

David Cournapeau

Bruce Southey

Sturla Molden

Sturla Molden

Bruce Southey

Sturla Molden

Sturla Molden

Sturla Molden

tags

participants (10)