[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Eric Carlson ecarlson at eng.ua.edu
Wed Feb 16 20:12:27 EST 2011


Sebastian,
Optimization appears to be important here. I used no optimization in my 
previous post, so you could try the -O3 compile option:

  gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math

for na=329 and nb=340 I get (about 7.5 speedup)
c_threads 1  time  0.00103106021881
c_threads 2  time  0.000528309345245
c_threads 3  time  0.000362541675568
c_threads 4  time  0.00028993844986
c_threads 5  time  0.000287840366364
c_threads 6  time  0.000264899730682
c_threads 7  time  0.000244019031525
c_threads 8  time  0.000242137908936
c_threads 9  time  0.000232398509979
c_threads 10  time  0.000227460861206
c_threads 11  time  0.00021938085556
c_threads 12  time  0.000216970443726
c_threads 13  time  0.000215198993683
c_threads 14  time  0.00021940946579
c_threads 15  time  0.000204219818115
c_threads 16  time  0.000216958522797
c_threads 17  time  0.000219728946686
c_threads 18  time  0.000199990272522
c_threads 19  time  0.000157492160797
c_threads 20  time  0.000171000957489
c_threads 21  time  0.000147500038147
c_threads 22  time  0.000141770839691
c_threads 23  time  0.000137741565704

for na=3290 and nb=3400 (about 11.5 speedup)
c_threads 1  time  0.100258581638
c_threads 2  time  0.0501346611977
c_threads 3  time  0.0335096096992
c_threads 4  time  0.0253720903397
c_threads 5  time  0.0208190107346
c_threads 6  time  0.0173784399033
c_threads 7  time  0.0148811817169
c_threads 8  time  0.0130474209785
c_threads 9  time  0.011598110199
c_threads 10  time  0.0104278612137
c_threads 11  time  0.00950778007507
c_threads 12  time  0.00870131969452
c_threads 13  time  0.015882730484
c_threads 14  time  0.0148504400253
c_threads 15  time  0.0139465212822
c_threads 16  time  0.0130301308632
c_threads 17  time  0.012240819931
c_threads 18  time  0.011567029953
c_threads 19  time  0.0109891605377
c_threads 20  time  0.0104281497002
c_threads 21  time  0.00992572069168
c_threads 22  time  0.00957406997681
c_threads 23  time  0.00936627149582


for na=329 and nb=340, cdist comes in at 0.00111914873123 which is 
1.085x slower than the c version for my system.

for na=3290 and nb=3400 cdist gives  0.143441538811

Cheers,
Eric





More information about the NumPy-Discussion mailing list