Here's a hack that google turned up:
(1) Use static variables instead of dynamic (stack) variables (2) Use in-line assembly code that explicitly aligns data (3) In C code, use "*malloc*" to explicitly allocate variables
Here is Intel's example of (2):
; procedure prologue push ebp mov esp, ebp and ebp, -8 sub esp, 12
; procedure epilogue add esp, 12 pop ebp ret
Intel's example of (3), slightly modified:
double *p, *newp; p = (double*)*malloc* ((sizeof(double)*NPTS)+4); newp = (p+4) & (~7);
This assures that newp is 8-*byte* aligned even if p is not. However, *malloc*() may already follow Intel's recommendation that a *32*-* byte* or greater data structures be aligned on a *32* *byte* boundary. In that case, increasing the requested memory by 4 bytes and computing newp are superfluous.
I think that for numpy arrays it should be possible to define the offset so that the result is 32 byte aligned. However, this might break some peoples' code if they haven't payed attention to the offset.
Why ? I really don't see how it can break anything at the source code level. You don't have to care about things you didn't care before: the best proof of that if that numpy runs on different platforms where the malloc has different alignment guarantees (mac OS X already aligned to 16 bytes, for the very reason of making optimizing with SIMD easier, whereas glibc malloc only aligns to 8 bytes, at least on Linux).
Another possibility is to allocate an oversized array, check the pointer, and take a range out of it. For instance:
In [32]: a = zeros(10)
In [33]: a.ctypes.data % 32 Out[33]: 16
The array alignment is 16 bytes, consequently
In [34]: a[2:].ctypes.data % 32 Out[34]: 0
Voila, 32 byte alignment. I think a short python routine could do this, which ought to serve well for 1D fft's. Multidimensional arrays will be trickier if you want the rows to be aligned. Aligning the columns just isn't going to work. I am not suggesting realigning existing arrays. What I would like numpy to support are the following cases:
- Check whether a given a numpy array is simd aligned: /* Simple case: if aligned, use optimized func, use non optimized otherwise */ int simd_func(double* in, size_t n); int nosimd_func(double* in, size_t n); if (PyArray_ISALIGNED_SIMD(a)) { simd_func((double *)a->data, a->size); } else { nosimd_func((double *)a->data, a->size); } - Request explicitely an aligned arrays from any PyArray_* functions which create a ndarray, eg: ar = PyArray_FROM_OF(a, NPY_SIMD_ALIGNED); Allocating a buffer aligned to a given alignment is not the problem: there is a posix functions to do it, and we can implement easily a function for the OS who do not support it. This would be done in C, not in python. cheers, David