[Numpy-discussion] Byte aligned arrays

Wed Dec 19 18:18:31 EST 2012

On Wed, Dec 19, 2012 at 6:03 PM, Francesc Alted <francesc at continuum.io> wrote:
> On 12/19/12 5:47 PM, Henry Gomersall wrote:
>> On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote:
>>> Not sure which interface is more useful to users. On the one hand,
>>> using funny dtypes makes regular non-SIMD access more cumbersome, and
>>> it forces your array size to be a multiple of the SIMD word size,
>>> which might be inconvenient if your code is smart enough to handle
>>> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
>>> SIMD for most of the array, and then a slow path to handle any partial
>>> word at the end). OTOH, if your code *is* that smart, you should
>>> probably just make it smart enough to handle a partial word at the
>>> beginning as well and then you won't need any special alignment in the
>>> first place, and representing each SIMD word as a single numpy scalar
>>> is an intuitively appealing model of how SIMD works. OTOOH, just
>>> adding a single argument np.array() is a much simpler to explain than
>>> some elaborate scheme involving the creation of special custom dtypes.
>> If it helps, my use-case is in wrapping the FFTW library. This _is_
>> smart enough to deal with unaligned arrays, but it just results in a
>> performance penalty. In the case of an FFT, there are clearly going to
>> be issues with the powers of two indices in the array not lying on a
>> suitable n-byte boundary (which would be the case with a misaligned
>> array), but I imagine it's not unique.
>>
>> The other point is that it's easy to create a suitable power of two
>> array that should always bypass any special case unaligned code (e.g.
>> with floats, any multiple of 4 array length will fill every 16-byte
>> word).
>>
>> Finally, I think there is significant value in auto-aligning the array
>> based on an appropriate inspection of the cpu capabilities (or
>> alternatively, a function that reports back the appropriate SIMD
>> alignment). Again, this makes it easier to wrap libraries that may
>> function with any alignment, but benefit from optimum alignment.
>
> Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on
> systems (Linux and Mac OSX):

Only by accident, at least on linux. The pointers returned by the  gnu
libc malloc are at least 8 bytes aligned, but they may not be 16 bytes
when you're above the threshold where mmap is used for malloc.

The difference between aligned and unaligned ram <-> sse registers
(e.g. movaps, movups) used to be significant. Don't know if that's
still the case for recent CPUs.

David