[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Sturla Molden sturla at molden.no
Fri Feb 18 18:40:07 EST 2011

Den 17.02.2011 16:31, skrev Matthieu Brucher:
> It may also be the sizes of the chunk OMP uses. You can/should specify 
> them.in <http://them.in>
> Matthieu
> the OMP pragma so that it is a multiple of the cache line size or 
> something close.

Also beware of "false sharing" among the threads. When one processor 
updates the array "dist" in Sebastian's code, the cache line is dirtied 
for the other processors:

   #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
      for(j=0;j<nb;j++) {
          dif_x = ax - b_ps[j*nx2];
          dif_y = ay - b_ps[j*nx2+1];

          /* update shared memory */

          dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);

          /* ... and poof the cache is dirty */


Whenever this happens, the processors must stop whatever they are doing 
to resynchronize their cache lines. "False sharing" can therefore work 
as an "invisible GIL" inside OpenMP code.The processors can appear to 
run in syrup, and there is excessive traffic on the memory bus.

This is also why MPI programs often scale better than OpenMP programs, 
despite the IPC overhead.

An advice when working with OpenMP is to let each thread write to 
private data arrays, and only share read-only arrays.

One can e.g. use OpenMP's "reduction" pragma to achieve this. E.g. 
intialize the array dist with zeros, and use reduction(+:dist) in the 
OpenMP pragma line.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110219/ba905aba/attachment.html>

More information about the NumPy-Discussion mailing list