[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Tue Feb 15 11:48:05 EST 2011

On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase <seb.haase at gmail.com> wrote:
> Wes,
> I think I should have a couple of GPUs. I would be ready for anything
> ... if you think that I could do some easy(!) CUDA programming here,
> maybe you could guide me into the right direction...
> Thanks,
> Sebastian.
>
>
> On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
>> <matthieu.brucher at gmail.com> wrote:
>>> Use directly restrict in C99 mode (__restrict does not have exactly the same
>>> semantics).
>>> For a valgrind profil, you can check my blog
>>> (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
>>> Basically, if you have a python script, you can valgrind --optionsinmyblog
>>> python myscript.py
>>> For PAPI, you have to install several packages (perf module for kernel for
>>> instance) and a GUI to analyze the results (in Eclispe, it should be
>>> possible).
>>> Matthieu
>>> 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>>>>
>>>> Thanks Matthieu,
>>>> using __restrict__ with g++ did not change anything. How do I use
>>>> valgrind with C extensions?
>>>> I don't know what "PAPI profil" is ...?
>>>> -Sebastian
>>>>
>>>>
>>>> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
>>>> <matthieu.brucher at gmail.com> wrote:
>>>> > Hi,
>>>> > My first move would be to add a restrict keyword to dist (i.e. dist is
>>>> > the
>>>> > only pointer to the specific memory location), and then declare dist_
>>>> > inside
>>>> > the first loop also with a restrict.
>>>> > Then, I would run valgrind or a PAPI profil on your code to see what
>>>> > causes
>>>> > the issue (false sharing, ...)
>>>> > Matthieu
>>>> >
>>>> > 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>>>> >>
>>>> >> Hi,
>>>> >> I assume that someone here could maybe help me, and I'm hoping it's
>>>> >> not too much off topic.
>>>> >> I have 2 arrays of 2d point coordinates and would like to calculate
>>>> >> all pairwise distances as fast as possible.
>>>> >> Going from Python/Numpy to a (Swigged) C extension already gave me a
>>>> >> 55x speedup.
>>>> >> (.9ms vs. 50ms for arrays of length 329 and 340).
>>>> >> I'm using gcc on Linux.
>>>> >> Now I'm wondering if I could go even faster !?
>>>> >> My hope that the compiler might automagically do some SSE2
>>>> >> optimization got disappointed.
>>>> >> Since I have a 4 core CPU I thought OpenMP might be an option;
>>>> >> I never used that, and after some playing around I managed to get
>>>> >> (only) 50% slowdown(!) :-(
>>>> >>
>>>> >> My code in short is this:
>>>> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
>>>> >> -------<Ccode> ----------
>>>> >> void dists2d(
>>>> >>                   double *a_ps, int nx1, int na,
>>>> >>                   double *b_ps, int nx2, int nb,
>>>> >>                   double *dist, int nx3, int ny3)  throw (char*)
>>>> >> {
>>>> >>  if(nx1 != 2)  throw (char*) "a must be of shape (n,2)";
>>>> >>  if(nx2 != 2)  throw (char*) "b must be of shape (n,2)";
>>>> >>  if(nx3 != nb || ny3 != na)    throw (char*) "c must be of shape
>>>> >> (na,nb)";
>>>> >>
>>>> >>  double *dist_;
>>>> >>  int i, j;
>>>> >>
>>>> >> #pragma omp parallel private(dist_, j, i)
>>>> >>  {
>>>> >> #pragma omp for nowait
>>>> >>        for(i=0;i<na;i++)
>>>> >>          {
>>>> >>                //num_threads=omp_get_num_threads();  --> 4
>>>> >>                dist_ = dist+i*nb;                 // dists_  is  only
>>>> >> introduced for OpenMP
>>>> >>                for(j=0;j<nb;j++)
>>>> >>                  {
>>>> >>                        *dist_++  = sqrt( sq(a_ps[i*nx1]   -
>>>> >> b_ps[j*nx2]) +
>>>> >>
>>>> >>  sq(a_ps[i*nx1+1]
>>>> >> - b_ps[j*nx2+1]) );
>>>> >>                  }
>>>> >>          }
>>>> >>  }
>>>> >> }
>>>> >> -------</Ccode> ----------
>>>> >> There is probably a simple mistake in this code - as I said I never
>>>> >> used OpenMP before.
>>>> >> It should be not too difficult to use OpenMP correctly here
>>>> >>  or -  maybe better -
>>>> >> is there a simple SSE(2,3,4) version that might be even better than
>>>> >> OpenMP... !?
>>>> >>
>>>> >> I supposed, that I did not get the #pragma omp lines right - any idea ?
>>>> >> Or is it in general not possible to speed this kind of code up using
>>>> >> OpenMP !?
>>>> >>
>>>> >> Thanks,
>>>> >> Sebastian Haase
>>>> >> _______________________________________________
>>>> >> NumPy-Discussion mailing list
>>>> >> NumPy-Discussion at scipy.org
>>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Information System Engineer, Ph.D.
>>>> > Blog: http://matt.eifelle.com
>>>> > LinkedIn: http://www.linkedin.com/in/matthieubrucher
>>>> >
>>>> > _______________________________________________
>>>> > NumPy-Discussion mailing list
>>>> > NumPy-Discussion at scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>> >
>>>> >
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>>
>>> --
>>> Information System Engineer, Ph.D.
>>> Blog: http://matt.eifelle.com
>>> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> As an aside, do you have a GPU-equipped machine? This function looks
>> like it would be an easy CUDA target. Depends on who the users of the
>> function are (e.g. do they also have the CUDA runtime) if whether you
>> wanted to go that route.
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

These days it's pretty low-hanging fruit-- but you need to learn a bit
about the CUDA API and programming model through the user guide:

http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

I would say it's time well-invested.

There is PyCUDA which makes the "boilerplate" parts of doing GPU
computing a bit more pleasant (just write the kernel)-- I've been
using Cython recently to interact with CUDA-based C code which is not
a terrible option either (more work than PyCUDA).

But if speed is your concern than the GPU will be *way* more bang for
your buck than OpenMP/SSE2. I'd be surprised if you can't get at least
20x speedup over single-threaded C code even without tweaking (and
maybe even a lot faster than that). But unfortunately with CUDA
programming you spend a lot of your time tweaking memory transactions
to get the best performance.