Use directly restrict in C99 mode (__restrict does not have exactly the same semantics).<div><br></div><div>For a valgrind profil, you can check my blog (<a href="http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/">http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/</a>)</div>

<div>Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py</div><div><br></div><div>For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible).</div>

<div><br></div><div>Matthieu</div><div><br><div class="gmail_quote">2011/2/15 Sebastian Haase <span dir="ltr"><<a href="mailto:seb.haase@gmail.com">seb.haase@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Thanks Matthieu,<br>

using __restrict__ with g++ did not change anything. How do I use<br>

valgrind with C extensions?<br>

I don't know what "PAPI profil" is ...?<br>

<font color="#888888">-Sebastian<br>

</font><div><div></div><div class="h5"><br>

<br>

On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher<br>

<<a href="mailto:matthieu.brucher@gmail.com">matthieu.brucher@gmail.com</a>> wrote:<br>

> Hi,<br>

> My first move would be to add a restrict keyword to dist (i.e. dist is the<br>

> only pointer to the specific memory location), and then declare dist_ inside<br>

> the first loop also with a restrict.<br>

> Then, I would run valgrind or a PAPI profil on your code to see what causes<br>

> the issue (false sharing, ...)<br>

> Matthieu<br>

><br>

> 2011/2/15 Sebastian Haase <<a href="mailto:seb.haase@gmail.com">seb.haase@gmail.com</a>><br>

>><br>

>> Hi,<br>

>> I assume that someone here could maybe help me, and I'm hoping it's<br>

>> not too much off topic.<br>

>> I have 2 arrays of 2d point coordinates and would like to calculate<br>

>> all pairwise distances as fast as possible.<br>

>> Going from Python/Numpy to a (Swigged) C extension already gave me a<br>

>> 55x speedup.<br>

>> (.9ms vs. 50ms for arrays of length 329 and 340).<br>

>> I'm using gcc on Linux.<br>

>> Now I'm wondering if I could go even faster !?<br>

>> My hope that the compiler might automagically do some SSE2<br>

>> optimization got disappointed.<br>

>> Since I have a 4 core CPU I thought OpenMP might be an option;<br>

>> I never used that, and after some playing around I managed to get<br>

>> (only) 50% slowdown(!) :-(<br>

>><br>

>> My code in short is this:<br>

>> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)<br>

>> -------<Ccode> ----------<br>

>> void dists2d(<br>

>>                   double *a_ps, int nx1, int na,<br>

>>                   double *b_ps, int nx2, int nb,<br>

>>                   double *dist, int nx3, int ny3)  throw (char*)<br>

>> {<br>

>>  if(nx1 != 2)  throw (char*) "a must be of shape (n,2)";<br>

>>  if(nx2 != 2)  throw (char*) "b must be of shape (n,2)";<br>

>>  if(nx3 != nb || ny3 != na)    throw (char*) "c must be of shape (na,nb)";<br>

>><br>

>>  double *dist_;<br>

>>  int i, j;<br>

>><br>

>> #pragma omp parallel private(dist_, j, i)<br>

>>  {<br>

>> #pragma omp for nowait<br>

>>        for(i=0;i<na;i++)<br>

>>          {<br>

>>                //num_threads=omp_get_num_threads();  --> 4<br>

>>                dist_ = dist+i*nb;                 // dists_  is  only<br>

>> introduced for OpenMP<br>

>>                for(j=0;j<nb;j++)<br>

>>                  {<br>

>>                        *dist_++  = sqrt( sq(a_ps[i*nx1]   - b_ps[j*nx2]) +<br>

>>                                                          sq(a_ps[i*nx1+1]<br>

>> - b_ps[j*nx2+1]) );<br>

>>                  }<br>

>>          }<br>

>>  }<br>

>> }<br>

>> -------</Ccode> ----------<br>

>> There is probably a simple mistake in this code - as I said I never<br>

>> used OpenMP before.<br>

>> It should be not too difficult to use OpenMP correctly here<br>

>>  or -  maybe better -<br>

>> is there a simple SSE(2,3,4) version that might be even better than<br>

>> OpenMP... !?<br>

>><br>

>> I supposed, that I did not get the #pragma omp lines right - any idea ?<br>

>> Or is it in general not possible to speed this kind of code up using<br>

>> OpenMP !?<br>

>><br>

>> Thanks,<br>

>> Sebastian Haase<br>

>> _______________________________________________<br>

>> NumPy-Discussion mailing list<br>

>> <a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

>> <a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

><br>

><br>

><br>

> --<br>

> Information System Engineer, Ph.D.<br>

> Blog: <a href="http://matt.eifelle.com" target="_blank">http://matt.eifelle.com</a><br>

> LinkedIn: <a href="http://www.linkedin.com/in/matthieubrucher" target="_blank">http://www.linkedin.com/in/matthieubrucher</a><br>

><br>

> _______________________________________________<br>

> NumPy-Discussion mailing list<br>

> <a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

> <a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

><br>

><br>

_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>Information System Engineer, Ph.D.<br>Blog: <a href="http://matt.eifelle.com" target="_blank">http://matt.eifelle.com</a><br>LinkedIn: <a href="http://www.linkedin.com/in/matthieubrucher" target="_blank">http://www.linkedin.com/in/matthieubrucher</a><br>


</div>