Python ctypes and OpenMP mystery

Hello All, I have been toying with OpenMP through f2py and ctypes. On the whole, the results of my efforts have been very encouraging. That said, some results are a bit perplexing. I have written identical routines that I run directly as a C-derived executable, and through ctypes as a shared library. I am running the tests on a dual-Xeon Ubuntu system with 12 cores and 24 threads. The C executable is SLIGHTLY faster than the ctypes at lower thread counts, but the C eventually has a speedup ratio of 12+, while the python caps off at 7.7, as shown below: threads C-speedup Python-speedup 1 1 1 2 2.07 1.98 3 3.1 2.96 4 4.11 3.93 5 4.97 4.75 6 5.94 5.54 7 6.83 6.53 8 7.78 7.3 9 8.68 7.68 10 9.62 7.42 11 10.38 7.51 12 10.44 7.26 13 7.19 6.04 14 7.7 5.73 15 8.27 6.03 16 8.81 6.29 17 9.37 6.55 18 9.9 6.67 19 10.36 6.9 20 10.98 7.01 21 11.45 6.97 22 11.92 7.1 23 12.2 7.08 These ratios are quite consistent from 100KB double arrays to 100MB double arrays, so I do not think it reflects a Python overhead issue. There is no question the routine is memory bandwidth constrained, and I feel lucky to squeeze the eventual 12+ ratio, but I am very perplexed as to why the performance of the Python-invoked routine seems to cap off. Does anyone have an explanation for the caps? Am I seeing some effect from ctypes, or the Python engine, or what? Cheers, Eric

A Saturday 12 February 2011 21:19:39 Eric Carlson escrigué:
Hello All, I have been toying with OpenMP through f2py and ctypes. On the whole, the results of my efforts have been very encouraging. That said, some results are a bit perplexing.
I have written identical routines that I run directly as a C-derived executable, and through ctypes as a shared library. I am running the tests on a dual-Xeon Ubuntu system with 12 cores and 24 threads. The C executable is SLIGHTLY faster than the ctypes at lower thread counts, but the C eventually has a speedup ratio of 12+, while the python caps off at 7.7, as shown below:
threads C-speedup Python-speedup 1 1 1 2 2.07 1.98 3 3.1 2.96 4 4.11 3.93 5 4.97 4.75 6 5.94 5.54 7 6.83 6.53 8 7.78 7.3 9 8.68 7.68 10 9.62 7.42 11 10.38 7.51 12 10.44 7.26 13 7.19 6.04 14 7.7 5.73 15 8.27 6.03 16 8.81 6.29 17 9.37 6.55 18 9.9 6.67 19 10.36 6.9 20 10.98 7.01 21 11.45 6.97 22 11.92 7.1 23 12.2 7.08
These ratios are quite consistent from 100KB double arrays to 100MB double arrays, so I do not think it reflects a Python overhead issue. There is no question the routine is memory bandwidth constrained, and I feel lucky to squeeze the eventual 12+ ratio, but I am very perplexed as to why the performance of the Python-invoked routine seems to cap off.
Does anyone have an explanation for the caps? Am I seeing some effect from ctypes, or the Python engine, or what?
It is difficult to realize what could be going on by only looking at the timings. Can you attach a small, self-contained benchmark? Not that I can offer a definitive answer, but I'm curious about this. -- Francesc Alted

Hello Francesc, The problem appears to related to my lack of optimization in the compilation. If I use gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math the C executable and ctypes/python versions behave almost identically. Getting decent behavior takes some thought, though, far from the incredible almost-automatic behavior of numexpr. Now I've got to figure out how to scale up a bunch of vector adds/multiplies. Neither numexpr or openmp get you very far with a bunch of "z=a*x+b*y"-type calcs. Cheers, Eric

A Thursday 17 February 2011 02:24:33 Eric Carlson escrigué:
Hello Francesc, The problem appears to related to my lack of optimization in the compilation. If I use
gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math
the C executable and ctypes/python versions behave almost identically.
Ahh, good to know.
Getting decent behavior takes some thought, though, far from the incredible almost-automatic behavior of numexpr.
numexpr uses a very simple method for distributing load among the threads, so I suppose this is why it is fast. The drawback is that numexpr only can be used for operations implying the same index (i.e. like a+b**3, but not for things like a[i+1]+b[i]**3). For other operations openmp is probably the best option (I should say the *easiest* option) right now.
Now I've got to figure out how to scale up a bunch of vector adds/multiplies. Neither numexpr or openmp get you very far with a bunch of "z=a*x+b*y"-type calcs.
For these sort of computations you are most probably hitting the memory bandwidth wall, so you are out of luck (at least until processors will be fast enough to allow compression to actually reduce the time spent in computations). Cheers, -- Francesc Alted
participants (2)
-
Eric Carlson
-
Francesc Alted