[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Wed Feb 16 07:29:47 EST 2011

Matthieu,
I got it to run in valgrind (using the options from your blog).
For dist2d() it says under "Types":
----------------------------------------------------------------------
Event Type   Incl.    Self    Short      Formula
------------------------------------------------------------------------------------
Instruction Fetch  379090    32    lr
Data Read Access 116136   7 Dr
Data Write Access   29921 11 Dw
L1 Instn Fetch Miss 195 4 |1mr
L1 Data Read Miss 270 2 Dlmr
L1 Data Write Miss 3636 0 Dlmw
L2 Instn Fetch Miss 125 4 |2mr
L2 Data Read Miss 144 1 D2mr
L2 Data Write Miss 83 0 D2mw
L1 Miss Sum 4101 6 L1m = |1mr + D1mr + D1mw
L2 Miss Sum 352 5 L2m = |2mr + D2mr + D2mw
Cycle Estimation 455300592 CEst = Ir + 10 L1m + 100 L2m
---------------------------------------------------------------------------------------
(hope this is readable - I used some OCR software to past it here)

(My source code still uses __restrict__)

Do I see this right, that the problem is shown by the lines
L1 Data Read Miss
and
L1 Miss Sum

I have no idea what to look for ....

Thanks for the help,
-- Sebastian

On Tue, Feb 15, 2011 at 5:25 PM, Matthieu Brucher
<matthieu.brucher at gmail.com> wrote:
> Use directly restrict in C99 mode (__restrict does not have exactly the same
> semantics).
> For a valgrind profil, you can check my blog
> (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
> Basically, if you have a python script, you can valgrind --optionsinmyblog
> python myscript.py
> For PAPI, you have to install several packages (perf module for kernel for
> instance) and a GUI to analyze the results (in Eclispe, it should be
> possible).
> Matthieu
> 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>>
>> Thanks Matthieu,
>> using __restrict__ with g++ did not change anything. How do I use
>> valgrind with C extensions?
>> I don't know what "PAPI profil" is ...?
>> -Sebastian
>>
>>
>> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
>> <matthieu.brucher at gmail.com> wrote:
>> > Hi,
>> > My first move would be to add a restrict keyword to dist (i.e. dist is
>> > the
>> > only pointer to the specific memory location), and then declare dist_
>> > inside
>> > the first loop also with a restrict.
>> > Then, I would run valgrind or a PAPI profil on your code to see what
>> > causes
>> > the issue (false sharing, ...)
>> > Matthieu
>> >
>> > 2011/2/15 Sebastian Haase <seb.haase at gmail.com>
>> >>
>> >> Hi,
>> >> I assume that someone here could maybe help me, and I'm hoping it's
>> >> not too much off topic.
>> >> I have 2 arrays of 2d point coordinates and would like to calculate
>> >> all pairwise distances as fast as possible.
>> >> Going from Python/Numpy to a (Swigged) C extension already gave me a
>> >> 55x speedup.
>> >> (.9ms vs. 50ms for arrays of length 329 and 340).
>> >> I'm using gcc on Linux.
>> >> Now I'm wondering if I could go even faster !?
>> >> My hope that the compiler might automagically do some SSE2
>> >> optimization got disappointed.
>> >> Since I have a 4 core CPU I thought OpenMP might be an option;
>> >> I never used that, and after some playing around I managed to get
>> >> (only) 50% slowdown(!) :-(
>> >>
>> >> My code in short is this:
>> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
>> >> -------<Ccode> ----------
>> >> void dists2d(
>> >>                   double *a_ps, int nx1, int na,
>> >>                   double *b_ps, int nx2, int nb,
>> >>                   double *dist, int nx3, int ny3)  throw (char*)
>> >> {
>> >>  if(nx1 != 2)  throw (char*) "a must be of shape (n,2)";
>> >>  if(nx2 != 2)  throw (char*) "b must be of shape (n,2)";
>> >>  if(nx3 != nb || ny3 != na)    throw (char*) "c must be of shape
>> >> (na,nb)";
>> >>
>> >>  double *dist_;
>> >>  int i, j;
>> >>
>> >> #pragma omp parallel private(dist_, j, i)
>> >>  {
>> >> #pragma omp for nowait
>> >>        for(i=0;i<na;i++)
>> >>          {
>> >>                //num_threads=omp_get_num_threads();  --> 4
>> >>                dist_ = dist+i*nb;                 // dists_  is  only
>> >> introduced for OpenMP
>> >>                for(j=0;j<nb;j++)
>> >>                  {
>> >>                        *dist_++  = sqrt( sq(a_ps[i*nx1]   -
>> >> b_ps[j*nx2]) +
>> >>
>> >>  sq(a_ps[i*nx1+1]
>> >> - b_ps[j*nx2+1]) );
>> >>                  }
>> >>          }
>> >>  }
>> >> }
>> >> -------</Ccode> ----------
>> >> There is probably a simple mistake in this code - as I said I never
>> >> used OpenMP before.
>> >> It should be not too difficult to use OpenMP correctly here
>> >>  or -  maybe better -
>> >> is there a simple SSE(2,3,4) version that might be even better than
>> >> OpenMP... !?
>> >>
>> >> I supposed, that I did not get the #pragma omp lines right - any idea ?
>> >> Or is it in general not possible to speed this kind of code up using
>> >> OpenMP !?
>> >>
>> >> Thanks,
>> >> Sebastian Haase
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> NumPy-Discussion at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> >
>> > --
>> > Information System Engineer, Ph.D.
>> > Blog: http://matt.eifelle.com
>> > LinkedIn: http://www.linkedin.com/in/matthieubrucher
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>