A Saturday 12 February 2011 21:19:39 Eric Carlson escrigué:
Hello All, I have been toying with OpenMP through f2py and ctypes. On the whole, the results of my efforts have been very encouraging. That said, some results are a bit perplexing.
I have written identical routines that I run directly as a C-derived executable, and through ctypes as a shared library. I am running the tests on a dual-Xeon Ubuntu system with 12 cores and 24 threads. The C executable is SLIGHTLY faster than the ctypes at lower thread counts, but the C eventually has a speedup ratio of 12+, while the python caps off at 7.7, as shown below:
threads C-speedup Python-speedup 1 1 1 2 2.07 1.98 3 3.1 2.96 4 4.11 3.93 5 4.97 4.75 6 5.94 5.54 7 6.83 6.53 8 7.78 7.3 9 8.68 7.68 10 9.62 7.42 11 10.38 7.51 12 10.44 7.26 13 7.19 6.04 14 7.7 5.73 15 8.27 6.03 16 8.81 6.29 17 9.37 6.55 18 9.9 6.67 19 10.36 6.9 20 10.98 7.01 21 11.45 6.97 22 11.92 7.1 23 12.2 7.08
These ratios are quite consistent from 100KB double arrays to 100MB double arrays, so I do not think it reflects a Python overhead issue. There is no question the routine is memory bandwidth constrained, and I feel lucky to squeeze the eventual 12+ ratio, but I am very perplexed as to why the performance of the Python-invoked routine seems to cap off.
Does anyone have an explanation for the caps? Am I seeing some effect from ctypes, or the Python engine, or what?
It is difficult to realize what could be going on by only looking at the timings. Can you attach a small, self-contained benchmark? Not that I can offer a definitive answer, but I'm curious about this. -- Francesc Alted