[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Tue Feb 15 15:11:12 EST 2011

Hi,

On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase <seb.haase at gmail.com>wrote:

> Hi,
> I assume that someone here could maybe help me, and I'm hoping it's
> not too much off topic.
> I have 2 arrays of 2d point coordinates and would like to calculate
> all pairwise distances as fast as possible.
> Going from Python/Numpy to a (Swigged) C extension already gave me a
> 55x speedup.
> (.9ms vs. 50ms for arrays of length 329 and 340).
>
With my very modest machine (Intel Dual CPU E2200, 2.2 Ghz) utilizing
scipy.spatial.distance.pdist will take some 1.5 ms for such arrays. Even the
simple linear algebra based full matrix calculations can be done less than 5
ms.

My two cents,
eat

> I'm using gcc on Linux.
> Now I'm wondering if I could go even faster !?
> My hope that the compiler might automagically do some SSE2
> optimization got disappointed.
> Since I have a 4 core CPU I thought OpenMP might be an option;
> I never used that, and after some playing around I managed to get
> (only) 50% slowdown(!) :-(
>
> My code in short is this:
> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
> -------<Ccode> ----------
> void dists2d(
>                   double *a_ps, int nx1, int na,
>                   double *b_ps, int nx2, int nb,
>                   double *dist, int nx3, int ny3)  throw (char*)
> {
>  if(nx1 != 2)  throw (char*) "a must be of shape (n,2)";
>  if(nx2 != 2)  throw (char*) "b must be of shape (n,2)";
>  if(nx3 != nb || ny3 != na)    throw (char*) "c must be of shape (na,nb)";
>
>  double *dist_;
>  int i, j;
>
> #pragma omp parallel private(dist_, j, i)
>  {
> #pragma omp for nowait
>        for(i=0;i<na;i++)
>          {
>                //num_threads=omp_get_num_threads();  --> 4
>                dist_ = dist+i*nb;                 // dists_  is  only
> introduced for OpenMP
>                for(j=0;j<nb;j++)
>                  {
>                        *dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +
>                                                          sq(a_ps[i*nx1+1] -
> b_ps[j*nx2+1]) );
>                  }
>          }
>  }
> }
> -------</Ccode> ----------
> There is probably a simple mistake in this code - as I said I never
> used OpenMP before.
> It should be not too difficult to use OpenMP correctly here
>  or -  maybe better -
> is there a simple SSE(2,3,4) version that might be even better than
> OpenMP... !?
>
> I supposed, that I did not get the #pragma omp lines right - any idea ?
> Or is it in general not possible to speed this kind of code up using OpenMP
> !?
>
> Thanks,
> Sebastian Haase
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110215/520aa7f4/attachment.html>