
(I originally posted this in comp.lang.python and was redirected here) In a quest to speed up numarray computations, I tried writing a 'threaded array' class for use on SMP systems that would distribute its workload across the processors. I hit a snag when I found out that since the Python interpreter is not reentrant, this effectively disables parallel processing in Python. I've come up with two solutions to this problem, both involving numarray's C functions that perform the actual vector operations: 1) Surround the C vector operations with Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS, thus allowing the vector operations (which don't access Python structures) to run in parallel with the interpreter. Python glue code would take care of threading and locking. 2) Move the parallelization into the C vector functions themselves. This would likely get poorer performance (a chain of vector operations couldn't be combined into one threaded operation). I'd much rather do #1, but will playing around with the interpreter state like that cause any problems? Update from original posting: I've partially implemented method #1 for Float64s. Running on four 2.4GHz Xeons (possibly two with hyperthreading?), I get about a 30% speedup while dividing 10 million Float64s, but a small (<10%) slowdown doing addition or multiplication. The operation was repeated 100 times, with the threads created outside of the loop (i.e. the threads weren't recreated for each iteration). Is there really that much overhead in Python? I can post the code I'm using and the numarray patch if it's requested.

Christopher T King wrote:
(I originally posted this in comp.lang.python and was redirected here)
In a quest to speed up numarray computations, I tried writing a 'threaded array' class for use on SMP systems that would distribute its workload across the processors. I hit a snag when I found out that since the Python interpreter is not reentrant, this effectively disables parallel processing in Python. I've come up with two solutions to this problem, both involving numarray's C functions that perform the actual vector operations:
1) Surround the C vector operations with Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS, thus allowing the vector operations (which don't access Python structures) to run in parallel with the interpreter. Python glue code would take care of threading and locking.
2) Move the parallelization into the C vector functions themselves. This would likely get poorer performance (a chain of vector operations couldn't be combined into one threaded operation).
I'd much rather do #1, but will playing around with the interpreter state like that cause any problems?
I don't think so, but it raises a number of questions that I ask just below.
Update from original posting:
I've partially implemented method #1 for Float64s. Running on four 2.4GHz Xeons (possibly two with hyperthreading?), I get about a 30% speedup while dividing 10 million Float64s, but a small (<10%) slowdown doing addition or multiplication. The operation was repeated 100 times, with the threads created outside of the loop (i.e. the threads weren't recreated for each iteration). Is there really that much overhead in Python? I can post the code I'm using and the numarray patch if it's requested.
Questions and comments: 1) I suppose you did this for generated ufunc code? (ideally one would put this in the codegenerator stuff but for the purposes of testing it would be fine). I guess we would like to see how you actually changed the code fragment (you can email me or Todd Miller directly if you wish) 2) How much improvement you would see depends on many details. But if you were doing this for 10 million element arrays, I'm surprised you saw such a small improvement (30% for 4 processors isn't worth the trouble it would seem). So seeing the actual test code would be helpful. If the array operation you are doing for numarray aren't simple (that's a specialized use of the word; by that I mean if the arrays are not the same type, aren't contiguous, aren't aligned, or aren't of proper byte-order) then there are a number of other issues that may slow it down quite a bit (and there are ways of improving these for parallel processing). 3) I don't speak as an expert on threading or parallel processors, but I believe so long as you don't call any Python API functions (either directly or indirectly) between the global interpreter lock release and reacquisition, you should be fine. The vector ufunc code in numarray should satisfy this fine. Perry Greenfield

On Thu, 1 Jul 2004, Perry Greenfield wrote:
1) I suppose you did this for generated ufunc code? (ideally one would put this in the codegenerator stuff but for the purposes of testing it would be fine). I guess we would like to see how you actually changed the code fragment (you can email me or Todd Miller directly if you wish)
Yep, I didn't know it was automatically generated :P
2) How much improvement you would see depends on many details. But if you were doing this for 10 million element arrays, I'm surprised you saw such a small improvement (30% for 4 processors isn't worth the trouble it would seem). So seeing the actual test code would be helpful. If the array operation you are doing for numarray aren't simple (that's a specialized use of the word; by that I mean if the arrays are not the same type, aren't contiguous, aren't aligned, or aren't of proper byte-order) then there are a number of other issues that may slow it down quite a bit (and there are ways of improving these for parallel processing).
I've been careful not to use anything to cause discontiguities in the arrays, and to keep them all the same type (Float64 in this case). See my next post for the code I'm using.
participants (2)
-
Christopher T King
-
Perry Greenfield