[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Tue Feb 15 11:33:13 EST 2011

Wes,
I think I should have a couple of GPUs. I would be ready for anything
... if you think that I could do some easy(!) CUDA programming here,
maybe you could guide me into the right direction...
Thanks,
Sebastian.

On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
> <matthieu.brucher at gmail.com> wrote:
>> Use directly restrict in C99 mode (__restrict does not have exactly the same
>> semantics).
>> For a valgrind profil, you can check my blog
>> (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
>> Basically, if you have a python script, you can valgrind --optionsinmyblog
>> python myscript.py
>> For PAPI, you have to install several packages (perf module for kernel for
>> instance) and a GUI to analyze the results (in Eclispe, it should be
>> possible).
>> Matthieu
>> 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>>>
>>> Thanks Matthieu,
>>> using __restrict__ with g++ did not change anything. How do I use
>>> valgrind with C extensions?
>>> I don't know what "PAPI profil" is ...?
>>> -Sebastian
>>>
>>>
>>> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
>>> <matthieu.brucher at gmail.com> wrote:
>>> > Hi,
>>> > My first move would be to add a restrict keyword to dist (i.e. dist is
>>> > the
>>> > only pointer to the specific memory location), and then declare dist_
>>> > inside
>>> > the first loop also with a restrict.
>>> > Then, I would run valgrind or a PAPI profil on your code to see what
>>> > causes
>>> > the issue (false sharing, ...)
>>> > Matthieu
>>> >
>>> > 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>>> >>
>>> >> Hi,
>>> >> I assume that someone here could maybe help me, and I'm hoping it's
>>> >> not too much off topic.
>>> >> I have 2 arrays of 2d point coordinates and would like to calculate
>>> >> all pairwise distances as fast as possible.
>>> >> Going from Python/Numpy to a (Swigged) C extension already gave me a
>>> >> 55x speedup.
>>> >> (.9ms vs. 50ms for arrays of length 329 and 340).
>>> >> I'm using gcc on Linux.
>>> >> Now I'm wondering if I could go even faster !?
>>> >> My hope that the compiler might automagically do some SSE2
>>> >> optimization got disappointed.
>>> >> Since I have a 4 core CPU I thought OpenMP might be an option;
>>> >> I never used that, and after some playing around I managed to get
>>> >> (only) 50% slowdown(!) :-(
>>> >>
>>> >> My code in short is this:
>>> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
>>> >> -------<Ccode> ----------
>>> >> void dists2d(
>>> >>                   double *a_ps, int nx1, int na,
>>> >>                   double *b_ps, int nx2, int nb,
>>> >>                   double *dist, int nx3, int ny3)  throw (char*)
>>> >> {
>>> >>  if(nx1 != 2)  throw (char*) "a must be of shape (n,2)";
>>> >>  if(nx2 != 2)  throw (char*) "b must be of shape (n,2)";
>>> >>  if(nx3 != nb || ny3 != na)    throw (char*) "c must be of shape
>>> >> (na,nb)";
>>> >>
>>> >>  double *dist_;
>>> >>  int i, j;
>>> >>
>>> >> #pragma omp parallel private(dist_, j, i)
>>> >>  {
>>> >> #pragma omp for nowait
>>> >>        for(i=0;i<na;i++)
>>> >>          {
>>> >>                //num_threads=omp_get_num_threads();  --> 4
>>> >>                dist_ = dist+i*nb;                 // dists_  is  only
>>> >> introduced for OpenMP
>>> >>                for(j=0;j<nb;j++)
>>> >>                  {
>>> >>                        *dist_++  = sqrt( sq(a_ps[i*nx1]   -
>>> >> b_ps[j*nx2]) +
>>> >>
>>> >>  sq(a_ps[i*nx1+1]
>>> >> - b_ps[j*nx2+1]) );
>>> >>                  }
>>> >>          }
>>> >>  }
>>> >> }
>>> >> -------</Ccode> ----------
>>> >> There is probably a simple mistake in this code - as I said I never
>>> >> used OpenMP before.
>>> >> It should be not too difficult to use OpenMP correctly here
>>> >>  or -  maybe better -
>>> >> is there a simple SSE(2,3,4) version that might be even better than
>>> >> OpenMP... !?
>>> >>
>>> >> I supposed, that I did not get the #pragma omp lines right - any idea ?
>>> >> Or is it in general not possible to speed this kind of code up using
>>> >> OpenMP !?
>>> >>
>>> >> Thanks,
>>> >> Sebastian Haase
>>> >> _______________________________________________
>>> >> NumPy-Discussion mailing list
>>> >> NumPy-Discussion at scipy.org
>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>> >
>>> >
>>> > --
>>> > Information System Engineer, Ph.D.
>>> > Blog: http://matt.eifelle.com
>>> > LinkedIn: http://www.linkedin.com/in/matthieubrucher
>>> >
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion at scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>> >
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>>
>> --
>> Information System Engineer, Ph.D.
>> Blog: http://matt.eifelle.com
>> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> As an aside, do you have a GPU-equipped machine? This function looks
> like it would be an easy CUDA target. Depends on who the users of the
> function are (e.g. do they also have the CUDA runtime) if whether you
> wanted to go that route.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>