On 8/3/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Andrew Straw wrote:
> Dear David,
>
> Both ideas, particularly the 2nd, would be excellent additions to numpy.
> I often use the Intel IPP (Integrated Performance Primitives) Library
> together with numpy, but I have to do all my memory allocation with the
> IPP to ensure fastest operation. I then create numpy views of the data.
> All this works brilliantly, but it would be really nice if I could
> allocate the memory directly in numpy.
>
> IPP allocates, and says it wants, 32 byte aligned memory (see, e.g.
> http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given
> that fftw3 apparently wants 16 byte aligned memory, my feeling is that,
>   if the effort is made, the alignment width should be specified at
> run-time, rather than hard-coded.
I think that doing it at runtime would be overkill, no ? I was thinking
about making it a compile option. Generally, at the ASM level, you need
16 bytes alignment (for instructions like movaps, which takes 16 bytes
in memory and put it in the SSE registers), this is not just fftw. Maybe
the 32 bytes alignment is useful for cache reasons, I don't know.

I don't think it would be difficult to implement and validate; what I
don't know at all is the implication of this at the binary level, if any.


Here's a hack that google turned up:

(1) Use static variables instead of dynamic (stack) variables
(2) Use in-line assembly code that explicitly aligns data
(3) In C code, use "malloc" to explicitly allocate variables

Here is Intel's example of (2):

; procedure prologue
push ebp
mov esp, ebp
and ebp, -8
sub esp, 12

; procedure epilogue
add esp, 12
pop ebp
ret

Intel's example of (3), slightly modified:

double *p, *newp;
p = (double*)malloc ((sizeof(double)*NPTS)+4);
newp = (p+4) & (~7);

This assures that newp is 8-byte aligned even if p is not. However,
malloc() may already follow Intel's recommendation that a 32- byte or
greater data structures be aligned on a 32 byte boundary. In that case,
increasing the requested memory by 4 bytes and computing newp are
superfluous.


Chuck