[Numpy-discussion] numpy arrays, data allocation and SIMD alignement

Sat Aug 4 02:25:38 EDT 2007

>
>
>     Here's a hack that google turned up:
>
>     (1) Use static variables instead of dynamic (stack) variables
>     (2) Use in-line assembly code that explicitly aligns data
>     (3) In C code, use "*malloc*" to explicitly allocate variables
>
>     Here is Intel's example of (2):
>
>     ; procedure prologue
>     push ebp
>     mov esp, ebp
>     and ebp, -8
>     sub esp, 12
>
>     ; procedure epilogue
>     add esp, 12
>     pop ebp
>     ret
>
>     Intel's example of (3), slightly modified:
>
>     double *p, *newp;
>     p = (double*)*malloc* ((sizeof(double)*NPTS)+4);
>     newp = (p+4) & (~7);
>
>     This assures that newp is 8-*byte* aligned even if p is not. However,
>     *malloc*() may already follow Intel's recommendation that a *32*-*
>     byte* or
>     greater data structures be aligned on a *32* *byte* boundary. In
>     that case,
>     increasing the requested memory by 4 bytes and computing newp are
>     superfluous.
>
>
> I think that for numpy arrays it should be possible to define the 
> offset so that the result is 32 byte aligned. However, this might 
> break some peoples' code if they haven't payed attention to the offset.
Why ? I really don't see how it can break anything at the source code 
level. You don't have to care about things you didn't care before: the 
best proof of that if that numpy runs on different platforms where the 
malloc has different alignment guarantees (mac OS X already aligned to 
16 bytes, for the very reason of making optimizing with SIMD easier, 
whereas glibc malloc only aligns to 8 bytes, at least on Linux).
> Another possibility is to allocate an oversized array, check the 
> pointer, and take a range out of it. For instance:
>
> In [32]: a = zeros(10)
>
> In [33]: a.ctypes.data % 32
> Out[33]: 16
>
> The array alignment is 16 bytes, consequently
>
> In [34]: a[2:].ctypes.data % 32
> Out[34]: 0
>
> Voila, 32 byte alignment. I think a short python routine could do 
> this, which ought to serve well for 1D fft's. Multidimensional arrays 
> will be trickier if you want the rows to be aligned. Aligning the 
> columns just isn't going to work.
I am not suggesting realigning existing arrays. What I would like numpy 
to support are the following cases:

 - Check whether a given a numpy array is simd aligned:

/* Simple case: if aligned, use optimized func, use non optimized 
otherwise */
int simd_func(double* in, size_t n);
int nosimd_func(double* in, size_t n);

if (PyArray_ISALIGNED_SIMD(a)) {
    simd_func((double *)a->data, a->size);
} else {
    nosimd_func((double *)a->data, a->size);
}
 - Request explicitely an aligned arrays from any PyArray_* functions 
which create a ndarray, eg: ar = PyArray_FROM_OF(a, NPY_SIMD_ALIGNED);

Allocating a buffer aligned to a given alignment is not the problem: 
there is a posix functions to do it, and we can implement easily a 
function for the OS who do not support it. This would be done in C, not 
in python.

cheers,

David