[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Matthieu Brucher matthieu.brucher at gmail.com
Tue Feb 15 10:54:55 EST 2011


Hi,

My first move would be to add a restrict keyword to dist (i.e. dist is the
only pointer to the specific memory location), and then declare dist_ inside
the first loop also with a restrict.
Then, I would run valgrind or a PAPI profil on your code to see what causes
the issue (false sharing, ...)

Matthieu

2011/2/15 Sebastian Haase <seb.haase at gmail.com>

> Hi,
> I assume that someone here could maybe help me, and I'm hoping it's
> not too much off topic.
> I have 2 arrays of 2d point coordinates and would like to calculate
> all pairwise distances as fast as possible.
> Going from Python/Numpy to a (Swigged) C extension already gave me a
> 55x speedup.
> (.9ms vs. 50ms for arrays of length 329 and 340).
> I'm using gcc on Linux.
> Now I'm wondering if I could go even faster !?
> My hope that the compiler might automagically do some SSE2
> optimization got disappointed.
> Since I have a 4 core CPU I thought OpenMP might be an option;
> I never used that, and after some playing around I managed to get
> (only) 50% slowdown(!) :-(
>
> My code in short is this:
> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
> -------<Ccode> ----------
> void dists2d(
>                   double *a_ps, int nx1, int na,
>                   double *b_ps, int nx2, int nb,
>                   double *dist, int nx3, int ny3)  throw (char*)
> {
>  if(nx1 != 2)  throw (char*) "a must be of shape (n,2)";
>  if(nx2 != 2)  throw (char*) "b must be of shape (n,2)";
>  if(nx3 != nb || ny3 != na)    throw (char*) "c must be of shape (na,nb)";
>
>  double *dist_;
>  int i, j;
>
> #pragma omp parallel private(dist_, j, i)
>  {
> #pragma omp for nowait
>        for(i=0;i<na;i++)
>          {
>                //num_threads=omp_get_num_threads();  --> 4
>                dist_ = dist+i*nb;                 // dists_  is  only
> introduced for OpenMP
>                for(j=0;j<nb;j++)
>                  {
>                        *dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +
>                                                          sq(a_ps[i*nx1+1] -
> b_ps[j*nx2+1]) );
>                  }
>          }
>  }
> }
> -------</Ccode> ----------
> There is probably a simple mistake in this code - as I said I never
> used OpenMP before.
> It should be not too difficult to use OpenMP correctly here
>  or -  maybe better -
> is there a simple SSE(2,3,4) version that might be even better than
> OpenMP... !?
>
> I supposed, that I did not get the #pragma omp lines right - any idea ?
> Or is it in general not possible to speed this kind of code up using OpenMP
> !?
>
> Thanks,
> Sebastian Haase
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110215/4ee709b8/attachment.html>


More information about the NumPy-Discussion mailing list