[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Sebastian Haase
seb.haase at gmail.com
Thu Feb 17 04:16:37 EST 2011
Eric,
thanks for insisting on this. I noticed that, when I saw it first,
just to forget about it again ...
The new timings on my machine are:
$: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math
$: gcc -shared -o the_lib.so the_lib.o -lgomp -lm
$: python2.5 the_python_prog.py
c_threads 1 time 0.000897128582001
c_threads 2 time 0.000540800094604
c_threads 3 time 0.00035933971405
c_threads 4 time 0.000529370307922
c_threads 5 time 0.00049122095108
c_threads 6 time 0.000540502071381
c_threads 7 time 0.000580079555511
c_threads 8 time 0.000643739700317
c_threads 9 time 0.000622930526733
c_threads 10 time 0.000680360794067
c_threads 11 time 0.000613269805908
c_threads 12 time 0.000633401870728
That is, your OpenMP version is again fastest using 3 threads on my 4 core CPU.
It is now 2.34x times faster than my non-OpenMP code (which compares
to scipy...cdist).
And, it is (only !?) 7% slower than then non-OpenMP code, when running
on 1 thread.
(The speedup 3 threads vs. 1 thread is 2.5x)
So, that is pretty good !! What I don't understand,
why did you start your first post with
"I don't have the slightest idea what I'm doing"
;-)
Do you think, one could get even better ?
And, where does the 7% slow-down (for single thread) come from ?
Is it possible to have the OpenMP option in a code, without _any_
penalty for 1 core machines ?
Thanks,
- Sebastian
On Thu, Feb 17, 2011 at 2:12 AM, Eric Carlson <ecarlson at eng.ua.edu> wrote:
> Sebastian,
> Optimization appears to be important here. I used no optimization in my
> previous post, so you could try the -O3 compile option:
>
> gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math
>
> for na=329 and nb=340 I get (about 7.5 speedup)
> c_threads 1 time 0.00103106021881
> c_threads 2 time 0.000528309345245
> c_threads 3 time 0.000362541675568
> c_threads 4 time 0.00028993844986
> c_threads 5 time 0.000287840366364
> c_threads 6 time 0.000264899730682
> c_threads 7 time 0.000244019031525
> c_threads 8 time 0.000242137908936
> c_threads 9 time 0.000232398509979
> c_threads 10 time 0.000227460861206
> c_threads 11 time 0.00021938085556
> c_threads 12 time 0.000216970443726
> c_threads 13 time 0.000215198993683
> c_threads 14 time 0.00021940946579
> c_threads 15 time 0.000204219818115
> c_threads 16 time 0.000216958522797
> c_threads 17 time 0.000219728946686
> c_threads 18 time 0.000199990272522
> c_threads 19 time 0.000157492160797
> c_threads 20 time 0.000171000957489
> c_threads 21 time 0.000147500038147
> c_threads 22 time 0.000141770839691
> c_threads 23 time 0.000137741565704
>
> for na=3290 and nb=3400 (about 11.5 speedup)
> c_threads 1 time 0.100258581638
> c_threads 2 time 0.0501346611977
> c_threads 3 time 0.0335096096992
> c_threads 4 time 0.0253720903397
> c_threads 5 time 0.0208190107346
> c_threads 6 time 0.0173784399033
> c_threads 7 time 0.0148811817169
> c_threads 8 time 0.0130474209785
> c_threads 9 time 0.011598110199
> c_threads 10 time 0.0104278612137
> c_threads 11 time 0.00950778007507
> c_threads 12 time 0.00870131969452
> c_threads 13 time 0.015882730484
> c_threads 14 time 0.0148504400253
> c_threads 15 time 0.0139465212822
> c_threads 16 time 0.0130301308632
> c_threads 17 time 0.012240819931
> c_threads 18 time 0.011567029953
> c_threads 19 time 0.0109891605377
> c_threads 20 time 0.0104281497002
> c_threads 21 time 0.00992572069168
> c_threads 22 time 0.00957406997681
> c_threads 23 time 0.00936627149582
>
>
> for na=329 and nb=340, cdist comes in at 0.00111914873123 which is
> 1.085x slower than the c version for my system.
>
> for na=3290 and nb=3400 cdist gives 0.143441538811
>
> Cheers,
> Eric
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
More information about the NumPy-Discussion
mailing list