[Numpy-discussion] numpy arrays, data allocation and SIMD alignement

Sat Aug 4 01:30:46 EDT 2007

On 8/3/07, David Cournapeau <david at ar.media.kyoto-u.ac.jp> wrote:
>
> Andrew Straw wrote:
> > Dear David,
> >
> > Both ideas, particularly the 2nd, would be excellent additions to numpy.
> > I often use the Intel IPP (Integrated Performance Primitives) Library
> > together with numpy, but I have to do all my memory allocation with the
> > IPP to ensure fastest operation. I then create numpy views of the data.
> > All this works brilliantly, but it would be really nice if I could
> > allocate the memory directly in numpy.
> >
> > IPP allocates, and says it wants, 32 byte aligned memory (see, e.g.
> > http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given
> > that fftw3 apparently wants 16 byte aligned memory, my feeling is that,
> >   if the effort is made, the alignment width should be specified at
> > run-time, rather than hard-coded.
> I think that doing it at runtime would be overkill, no ? I was thinking
> about making it a compile option. Generally, at the ASM level, you need
> 16 bytes alignment (for instructions like movaps, which takes 16 bytes
> in memory and put it in the SSE registers), this is not just fftw. Maybe
> the 32 bytes alignment is useful for cache reasons, I don't know.
>
> I don't think it would be difficult to implement and validate; what I
> don't know at all is the implication of this at the binary level, if any.

Here's a hack that google turned up:

(1) Use static variables instead of dynamic (stack) variables
(2) Use in-line assembly code that explicitly aligns data
(3) In C code, use "*malloc*" to explicitly allocate variables

Here is Intel's example of (2):

; procedure prologue
push ebp
mov esp, ebp
and ebp, -8
sub esp, 12

; procedure epilogue
add esp, 12
pop ebp
ret

Intel's example of (3), slightly modified:

double *p, *newp;
p = (double*)*malloc* ((sizeof(double)*NPTS)+4);
newp = (p+4) & (~7);

This assures that newp is 8-*byte* aligned even if p is not. However,
*malloc*() may already follow Intel's recommendation that a *32*-*byte* or
greater data structures be aligned on a *32* *byte* boundary. In that case,
increasing the requested memory by 4 bytes and computing newp are
superfluous.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070803/6eb3641a/attachment.html>