[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Wes McKinney wesmckinn at gmail.com
Tue Feb 15 11:26:26 EST 2011


On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
<matthieu.brucher at gmail.com> wrote:
> Use directly restrict in C99 mode (__restrict does not have exactly the same
> semantics).
> For a valgrind profil, you can check my blog
> (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
> Basically, if you have a python script, you can valgrind --optionsinmyblog
> python myscript.py
> For PAPI, you have to install several packages (perf module for kernel for
> instance) and a GUI to analyze the results (in Eclispe, it should be
> possible).
> Matthieu
> 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>>
>> Thanks Matthieu,
>> using __restrict__ with g++ did not change anything. How do I use
>> valgrind with C extensions?
>> I don't know what "PAPI profil" is ...?
>> -Sebastian
>>
>>
>> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
>> <matthieu.brucher at gmail.com> wrote:
>> > Hi,
>> > My first move would be to add a restrict keyword to dist (i.e. dist is
>> > the
>> > only pointer to the specific memory location), and then declare dist_
>> > inside
>> > the first loop also with a restrict.
>> > Then, I would run valgrind or a PAPI profil on your code to see what
>> > causes
>> > the issue (false sharing, ...)
>> > Matthieu
>> >
>> > 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>> >>
>> >> Hi,
>> >> I assume that someone here could maybe help me, and I'm hoping it's
>> >> not too much off topic.
>> >> I have 2 arrays of 2d point coordinates and would like to calculate
>> >> all pairwise distances as fast as possible.
>> >> Going from Python/Numpy to a (Swigged) C extension already gave me a
>> >> 55x speedup.
>> >> (.9ms vs. 50ms for arrays of length 329 and 340).
>> >> I'm using gcc on Linux.
>> >> Now I'm wondering if I could go even faster !?
>> >> My hope that the compiler might automagically do some SSE2
>> >> optimization got disappointed.
>> >> Since I have a 4 core CPU I thought OpenMP might be an option;
>> >> I never used that, and after some playing around I managed to get
>> >> (only) 50% slowdown(!) :-(
>> >>
>> >> My code in short is this:
>> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
>> >> -------<Ccode> ----------
>> >> void dists2d(
>> >>                   double *a_ps, int nx1, int na,
>> >>                   double *b_ps, int nx2, int nb,
>> >>                   double *dist, int nx3, int ny3)  throw (char*)
>> >> {
>> >>  if(nx1 != 2)  throw (char*) "a must be of shape (n,2)";
>> >>  if(nx2 != 2)  throw (char*) "b must be of shape (n,2)";
>> >>  if(nx3 != nb || ny3 != na)    throw (char*) "c must be of shape
>> >> (na,nb)";
>> >>
>> >>  double *dist_;
>> >>  int i, j;
>> >>
>> >> #pragma omp parallel private(dist_, j, i)
>> >>  {
>> >> #pragma omp for nowait
>> >>        for(i=0;i<na;i++)
>> >>          {
>> >>                //num_threads=omp_get_num_threads();  --> 4
>> >>                dist_ = dist+i*nb;                 // dists_  is  only
>> >> introduced for OpenMP
>> >>                for(j=0;j<nb;j++)
>> >>                  {
>> >>                        *dist_++  = sqrt( sq(a_ps[i*nx1]   -
>> >> b_ps[j*nx2]) +
>> >>
>> >>  sq(a_ps[i*nx1+1]
>> >> - b_ps[j*nx2+1]) );
>> >>                  }
>> >>          }
>> >>  }
>> >> }
>> >> -------</Ccode> ----------
>> >> There is probably a simple mistake in this code - as I said I never
>> >> used OpenMP before.
>> >> It should be not too difficult to use OpenMP correctly here
>> >>  or -  maybe better -
>> >> is there a simple SSE(2,3,4) version that might be even better than
>> >> OpenMP... !?
>> >>
>> >> I supposed, that I did not get the #pragma omp lines right - any idea ?
>> >> Or is it in general not possible to speed this kind of code up using
>> >> OpenMP !?
>> >>
>> >> Thanks,
>> >> Sebastian Haase
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> NumPy-Discussion at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> >
>> > --
>> > Information System Engineer, Ph.D.
>> > Blog: http://matt.eifelle.com
>> > LinkedIn: http://www.linkedin.com/in/matthieubrucher
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

As an aside, do you have a GPU-equipped machine? This function looks
like it would be an easy CUDA target. Depends on who the users of the
function are (e.g. do they also have the CUDA runtime) if whether you
wanted to go that route.



More information about the NumPy-Discussion mailing list