All the discussion recently about pyprocessing got me interested in actually benchmarking Python's multiprocessing performance to see if reality matched my expectations around what would scale up and what would not. I knew Python threads wouldn't be good for compute bound problems, but I was curious to see how well they worked for i/o bound problems. The short answer is that for i/o bound problems, python threads worked just as well as using multiple operating system processes. I wrote two simple benchmarks, one compute bound and the other i/o bound. The compute bound one did a parallel matrix multiply and the i/ o bound one read random records from a remote MySQL database. I ran each benchmark via python's thread module and via MPI (using mpi4py and openmpi and Send()/Recv() for communication). Each test was run multiple times and the numbers were consistent between test runs. I ran the tests on a dual-core Macbook Pro running OS X 10.5 and the included python 2.5.1. 1) Python threads a) compute bound: 1 thread - 16 seconds 2 threads - 21 seconds b) i/o bound: 1 thread -- 13 seconds 4 threads -- 10 seconds 8 threads -- 5 seconds 12 threads - 4 seconds 2) MPI a) compute bound: 1 thread - 17 seconds 2 threads -- 11 seconds b) i/o bound 1 thread -- 13 seconds 4 threads -- 10 seconds 8 threads -- 6 seconds 12 threads -- 4 seconds
Tom Pinckney wrote:
All the discussion recently about pyprocessing got me interested in actually benchmarking Python's multiprocessing performance to see if reality matched my expectations around what would scale up and what would not. I knew Python threads wouldn't be good for compute bound problems, but I was curious to see how well they worked for i/o bound problems. The short answer is that for i/o bound problems, python threads worked just as well as using multiple operating system processes.
Interesting - given that your example compute bound problem happened to be a matrix multiply, I'd be curious what the results are when using python threads with numpy to do the same thing (my understanding is that numpy will usually release the GIL while doing serious number-crunching) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
I switched to using numpy for the matrix multiply and while the overall time to do the matrix multiply is much faster, there is still no speed up from using more than one python thread. If I look at top while running 2 or more threads, both cores are being used 100% and there is no idle time on the system. I did a quick google search and didn't find anything conclusive about numpy releasing the GIL. The most conclusive and recent reference I found was http://mail.python.org/pipermail/python-list/2007-October/463148.html I found some other references where people were expressing concern over numpy releasing the GIL due to the fact that other C extensions could call numpy and unexpectedly have the GIL released on them (or something like that). On May 15, 2008, at 6:43 PM, Nick Coghlan wrote:
Tom Pinckney wrote:
All the discussion recently about pyprocessing got me interested in actually benchmarking Python's multiprocessing performance to see if reality matched my expectations around what would scale up and what would not. I knew Python threads wouldn't be good for compute bound problems, but I was curious to see how well they worked for i/ o bound problems. The short answer is that for i/o bound problems, python threads worked just as well as using multiple operating system processes.
Interesting - given that your example compute bound problem happened to be a matrix multiply, I'd be curious what the results are when using python threads with numpy to do the same thing (my understanding is that numpy will usually release the GIL while doing serious number-crunching)
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Tom Pinckney wrote:
If I look at top while running 2 or more threads, both cores are being used 100% and there is no idle time on the system.
If you run it with just one thread, does it use up only one core's worth of CPU? If so, this suggests that the GIL is being released. If it wasn't, two threads would still only use one core's worth. Also, if you have only two cores, using more than two threads isn't going to gain you anything whatever happens. -- Greg
Interestingly, I think there's something magic going on with numpy.dot() on my mac. If I just run a program without threading--that is just a numpy matrix multiply such as: import numpy a = numpy.empty((4000,4000)) b = numpy.empty((4000,4000)) c = numpy.dot(a,b) then I see both cores fully maxed out on my mac. On a dual-core linux machine I see only one core maxed out by this program and it runs VASTLY slower on the linux box. It turns out that numpy on Mac's uses Apple's Accelerate.framekwork BLAS and LAPACK which in turn is multi-threaded as of OS X 10.4.8. On May 15, 2008, at 10:55 PM, Greg Ewing wrote:
Tom Pinckney wrote:
If I look at top while running 2 or more threads, both cores are being used 100% and there is no idle time on the system.
If you run it with just one thread, does it use up only one core's worth of CPU?
If so, this suggests that the GIL is being released. If it wasn't, two threads would still only use one core's worth.
Also, if you have only two cores, using more than two threads isn't going to gain you anything whatever happens.
-- Greg _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.co...
On Thu, 2008-05-15 at 21:02 -0400, Tom Pinckney wrote:
I found some other references where people were expressing concern over numpy releasing the GIL due to the fact that other C extensions could call numpy and unexpectedly have the GIL released on them (or something like that).
Could you please post links to those? I'm asking because AFAIK that concern doesn't really stand. Any (correct) code that releases the GIL is responsible for reacquiring it before calling *any* Python code, in fact before doing anything that might touch a Python object or its refcount.
Here's one example, albeit from a few years ago http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1625465 But, I am a numpy novice and so no idea what it actually does in its current form. On May 16, 2008, at 4:17 AM, Hrvoje Nik?i? wrote:
On Thu, 2008-05-15 at 21:02 -0400, Tom Pinckney wrote:
I found some other references where people were expressing concern over numpy releasing the GIL due to the fact that other C extensions could call numpy and unexpectedly have the GIL released on them (or something like that).
Could you please post links to those? I'm asking because AFAIK that concern doesn't really stand. Any (correct) code that releases the GIL is responsible for reacquiring it before calling *any* Python code, in fact before doing anything that might touch a Python object or its refcount.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.co...
On Fri, 2008-05-16 at 08:04 -0400, Tom Pinckney wrote:
Here's one example, albeit from a few years ago
http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1625465
Thanks for the pointer. I'm not sure I fully understand Konrad Hinsen's concerns, but maybe the problem is that Numpy's "number-crunching" needs to call back into Python frequently. The notion of "releasing the GIL for number-crunching" assumes that the code is structured like this: 1. code that works with python objects ... 2. acquire pointer to a C struct/array ... 3. release GIL 4. work with C objects ("crunch the numbers") without calling any Python code and without touching Python objects or refcounts 5. reacquire GIL If step 4 needs to call into Python frequently, then this strategy won't really work.
2008/5/16 Hrvoje Nikšić <hrvoje.niksic@avl.com>:
On Fri, 2008-05-16 at 08:04 -0400, Tom Pinckney wrote:
Here's one example, albeit from a few years ago
http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1625465
Thanks for the pointer. I'm not sure I fully understand Konrad Hinsen's concerns, but maybe the problem is that Numpy's "number-crunching" needs to call back into Python frequently. The notion of "releasing the GIL for number-crunching" assumes that the code is structured like this:
1. code that works with python objects ... 2. acquire pointer to a C struct/array ... 3. release GIL 4. work with C objects ("crunch the numbers") without calling any Python code and without touching Python objects or refcounts 5. reacquire GIL
If step 4 needs to call into Python frequently, then this strategy won't really work.
Hi, The current version of Numpy releases as soon as possible the GIL. The usual macros for starting and stopping a GIL release (as advertised in the documentation) are present for every step 4 work, whenever is possible (for instance it is the case for universal functions which are used in a lot of numpy's functions). Matthieu -- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher
Do you have the code posted someplace for this? I'd like to add it into the tests I am running On May 15, 2008, at 11:56 AM, Tom Pinckney <thomaspinckney3@gmail.com> wrote:
All the discussion recently about pyprocessing got me interested in actually benchmarking Python's multiprocessing performance to see if reality matched my expectations around what would scale up and what would not. I knew Python threads wouldn't be good for compute bound problems, but I was curious to see how well they worked for i/o bound problems. The short answer is that for i/o bound problems, python threads worked just as well as using multiple operating system processes.
I wrote two simple benchmarks, one compute bound and the other i/o bound. The compute bound one did a parallel matrix multiply and the i/o bound one read random records from a remote MySQL database. I ran each benchmark via python's thread module and via MPI (using mpi4py and openmpi and Send()/Recv() for communication). Each test was run multiple times and the numbers were consistent between test runs. I ran the tests on a dual-core Macbook Pro running OS X 10.5 and the included python 2.5.1.
1) Python threads
a) compute bound:
1 thread - 16 seconds 2 threads - 21 seconds
b) i/o bound:
1 thread -- 13 seconds 4 threads -- 10 seconds 8 threads -- 5 seconds 12 threads - 4 seconds
2) MPI
a) compute bound:
1 thread - 17 seconds 2 threads -- 11 seconds
b) i/o bound
1 thread -- 13 seconds 4 threads -- 10 seconds 8 threads -- 6 seconds 12 threads -- 4 seconds _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/jnoller%40gmail.com
On May 15, 2008, at 6:54 PM, Eric Smith <eric+python- dev@trueblade.com> wrote:
Jesse Noller wrote:
Do you have the code posted someplace for this? I'd like to add it into the tests I am running
It would also be interesting to see how pyprocessing performs.
Eric.
I'm working on exactly that - I have some I/O work - mandlebrot crunching, prime crunching and othe tests to run with threads, pyprocessing, parallel python (as soon as I figure out why it's hitting open file handle issues) and single threaded -Jesse
participants (7)
-
Eric Smith
-
Greg Ewing
-
Hrvoje Nikšić
-
Jesse Noller
-
Matthieu Brucher
-
Nick Coghlan
-
Tom Pinckney