[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Thu Feb 17 04:16:37 EST 2011

Eric,
thanks for insisting on this. I noticed that, when I saw it first,
just to forget about it again ...
The new timings on my machine are:
$: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math
$: gcc -shared -o the_lib.so the_lib.o -lgomp -lm
$: python2.5 the_python_prog.py
c_threads 1  time  0.000897128582001
c_threads 2  time  0.000540800094604
c_threads 3  time  0.00035933971405
c_threads 4  time  0.000529370307922
c_threads 5  time  0.00049122095108
c_threads 6  time  0.000540502071381
c_threads 7  time  0.000580079555511
c_threads 8  time  0.000643739700317
c_threads 9  time  0.000622930526733
c_threads 10  time  0.000680360794067
c_threads 11  time  0.000613269805908
c_threads 12  time  0.000633401870728

That is, your OpenMP version is again fastest using 3 threads on my 4 core CPU.
It is now 2.34x times faster than my non-OpenMP code (which compares
to scipy...cdist).
And, it is (only !?) 7% slower than then non-OpenMP code, when running
on 1 thread.
(The speedup 3 threads vs. 1 thread is 2.5x)

So, that is pretty good !! What I don't understand,
why did you start your first post with
"I don't have the slightest idea what I'm doing"
;-)
Do you think, one could get even better ?
And, where does the 7% slow-down (for single thread) come from ?
Is it possible to have the OpenMP option in a code, without _any_
penalty for 1 core machines ?

Thanks,
- Sebastian

On Thu, Feb 17, 2011 at 2:12 AM, Eric Carlson <ecarlson at eng.ua.edu> wrote:
> Sebastian,
> Optimization appears to be important here. I used no optimization in my
> previous post, so you could try the -O3 compile option:
>
>  gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math
>
> for na=329 and nb=340 I get (about 7.5 speedup)
> c_threads 1  time  0.00103106021881
> c_threads 2  time  0.000528309345245
> c_threads 3  time  0.000362541675568
> c_threads 4  time  0.00028993844986
> c_threads 5  time  0.000287840366364
> c_threads 6  time  0.000264899730682
> c_threads 7  time  0.000244019031525
> c_threads 8  time  0.000242137908936
> c_threads 9  time  0.000232398509979
> c_threads 10  time  0.000227460861206
> c_threads 11  time  0.00021938085556
> c_threads 12  time  0.000216970443726
> c_threads 13  time  0.000215198993683
> c_threads 14  time  0.00021940946579
> c_threads 15  time  0.000204219818115
> c_threads 16  time  0.000216958522797
> c_threads 17  time  0.000219728946686
> c_threads 18  time  0.000199990272522
> c_threads 19  time  0.000157492160797
> c_threads 20  time  0.000171000957489
> c_threads 21  time  0.000147500038147
> c_threads 22  time  0.000141770839691
> c_threads 23  time  0.000137741565704
>
> for na=3290 and nb=3400 (about 11.5 speedup)
> c_threads 1  time  0.100258581638
> c_threads 2  time  0.0501346611977
> c_threads 3  time  0.0335096096992
> c_threads 4  time  0.0253720903397
> c_threads 5  time  0.0208190107346
> c_threads 6  time  0.0173784399033
> c_threads 7  time  0.0148811817169
> c_threads 8  time  0.0130474209785
> c_threads 9  time  0.011598110199
> c_threads 10  time  0.0104278612137
> c_threads 11  time  0.00950778007507
> c_threads 12  time  0.00870131969452
> c_threads 13  time  0.015882730484
> c_threads 14  time  0.0148504400253
> c_threads 15  time  0.0139465212822
> c_threads 16  time  0.0130301308632
> c_threads 17  time  0.012240819931
> c_threads 18  time  0.011567029953
> c_threads 19  time  0.0109891605377
> c_threads 20  time  0.0104281497002
> c_threads 21  time  0.00992572069168
> c_threads 22  time  0.00957406997681
> c_threads 23  time  0.00936627149582
>
>
> for na=329 and nb=340, cdist comes in at 0.00111914873123 which is
> 1.085x slower than the c version for my system.
>
> for na=3290 and nb=3400 cdist gives  0.143441538811
>
> Cheers,
> Eric
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>