[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Sat Feb 19 15:49:52 EST 2011

Write miss are indication that data had to be imported inside L1 before it
could be written.
I don't know if valgrind can give indication of false sharing,
unfortunately. That's why I suggested you use a multiple of the cache line
so that false sharing do not occur.

Matthieu

2011/2/19 Sebastian Haase <seb.haase at gmail.com>

> Thanks a lot. Very informative. I guess what you say about "cache line
> is dirtied" is related to the info I got with valgrind (see my email
> in this thread: L1 Data Write Miss 3636).
> Can one assume that the cache line is always a few mega bytes ?
>
> Thanks,
> Sebastian
>
> On Sat, Feb 19, 2011 at 12:40 AM, Sturla Molden <sturla at molden.no> wrote:
> > Den 17.02.2011 16:31, skrev Matthieu Brucher:
> >
> > It may also be the sizes of the chunk OMP uses. You can/should specify
> > them.in
> >
> > Matthieu
> >
> > the OMP pragma so that it is a multiple of the cache line size or
> something
> > close.
> >
> > Also beware of "false sharing" among the threads. When one processor
> updates
> > the array "dist" in Sebastian's code, the cache line is dirtied for the
> > other processors:
> >
> >   #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
> >   for(i=0;i<na;i++) {
> >      ax=a_ps[i*nx1];
> >      ay=a_ps[i*nx1+1];
> >      for(j=0;j<nb;j++) {
> >          dif_x = ax - b_ps[j*nx2];
> >          dif_y = ay - b_ps[j*nx2+1];
> >
> >          /* update shared memory */
> >
> >          dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
> >
> >          /* ... and poof the cache is dirty */
> >
> >      }
> >   }
> >
> > Whenever this happens, the processors must stop whatever they are doing
> to
> > resynchronize their cache lines. "False sharing" can therefore work as an
> > "invisible GIL" inside OpenMP code.The processors can appear to run in
> > syrup, and there is excessive traffic on the memory bus.
> >
> > This is also why MPI programs often scale better than OpenMP programs,
> > despite the IPC overhead.
> >
> > An advice when working with OpenMP is to let each thread write to private
> > data arrays, and only share read-only arrays.
> >
> > One can e.g. use OpenMP's "reduction" pragma to achieve this. E.g.
> intialize
> > the array dist with zeros, and use reduction(+:dist) in the OpenMP pragma
> > line.
> >
> > Sturla
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110219/345f0eb7/attachment.html>