[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Thu Feb 17 10:23:45 EST 2011

Hi,
More surprises:
shaase at iris:~/code/SwiggedDistOMP: gcc -O3 -c the_lib.c -fPIC -fopenmp
-ffast-math
shaase at iris:~/code/SwiggedDistOMP: gcc -shared -o the_lib.so the_lib.o
-lgomp -lm
shaase at iris:~/code/SwiggedDistOMP: priithon the_python_prog.py
c_threads 0  time  0.000437839031219    # this is now, without
#pragma omp parallel for ...
c_threads 1  time  0.000865449905396
c_threads 2  time  0.000520548820496
c_threads 3  time  0.00033704996109
c_threads 4  time  0.000620169639587
c_threads 5  time  0.000465350151062
c_threads 6  time  0.000696349143982

This correct now the timing of, max OpenMP speed (3 threads) vs. no
OpenMP to speedup of (only!) 1.3x
Not 2.33x (which was the number I got when comparing OpenMP to the
cdist function).
The c code is now:

the_lib.c
------------------------------------------------------------------------------------------
#include <stdio.h>
#include <time.h>
#include <omp.h>
#include <math.h>

void dists2d(      double *a_ps, int na,
                  double *b_ps, int nb,
                  double *dist, int num_threads)
{

   int i, j;
   double ax,ay, dif_x, dif_y;
   int nx1=2;
   int nx2=2;

   if(num_threads>0)
	 {
   int dynamic=0;
   omp_set_dynamic(dynamic);
   omp_set_num_threads(num_threads);

#pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
       for(i=0;i<na;i++)
         {
               ax=a_ps[i*nx1];
                ay=a_ps[i*nx1+1];
               for(j=0;j<nb;j++)
                 {     dif_x = ax - b_ps[j*nx2];
                        dif_y = ay - b_ps[j*nx2+1];
                        dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
                 }
         }
	 } else {
       for(i=0;i<na;i++)
         {
               ax=a_ps[i*nx1];
                ay=a_ps[i*nx1+1];
               for(j=0;j<nb;j++)
                 {     dif_x = ax - b_ps[j*nx2];
                        dif_y = ay - b_ps[j*nx2+1];
                        dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
                 }
         }
   }
}
------------------------------------------------------------------
$ gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math
$ gcc -shared -o the_lib.so the_lib.o -lgomp -lm

So, I guess I found a way of getting rid of the OpenMP overhead when
run with 1 thread,
and found that - if measured correctly, using same compiler settings
and so on - the speedup is so small that there no point in doing
OpenMP - again.
(For my case, having (only) 4 cores)

Cheers,
Sebastian.

On Thu, Feb 17, 2011 at 10:57 AM, Matthieu Brucher
<matthieu.brucher at gmail.com> wrote:
>
>> Then, where does the overhead come from ? --
>> The call to    omp_set_dynamic(dynamic);
>> Or the
>> #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
>
> It may be this. You initialize a thread pool, even if it has only one
> thread, and there is the dynamic part, so OpenMP may create several chunks
> instead of one big chunk.
>
> Matthieu
> --
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>