[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
sturla at molden.no
Fri Feb 18 18:40:07 EST 2011
Den 17.02.2011 16:31, skrev Matthieu Brucher:
> It may also be the sizes of the chunk OMP uses. You can/should specify
> them.in <http://them.in>
> the OMP pragma so that it is a multiple of the cache line size or
> something close.
Also beware of "false sharing" among the threads. When one processor
updates the array "dist" in Sebastian's code, the cache line is dirtied
for the other processors:
#pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
dif_x = ax - b_ps[j*nx2];
dif_y = ay - b_ps[j*nx2+1];
/* update shared memory */
dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y);
/* ... and poof the cache is dirty */
Whenever this happens, the processors must stop whatever they are doing
to resynchronize their cache lines. "False sharing" can therefore work
as an "invisible GIL" inside OpenMP code.The processors can appear to
run in syrup, and there is excessive traffic on the memory bus.
This is also why MPI programs often scale better than OpenMP programs,
despite the IPC overhead.
An advice when working with OpenMP is to let each thread write to
private data arrays, and only share read-only arrays.
One can e.g. use OpenMP's "reduction" pragma to achieve this. E.g.
intialize the array dist with zeros, and use reduction(+:dist) in the
OpenMP pragma line.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion