[Numpy-discussion] Byte aligned arrays

Francesc Alted francesc at continuum.io
Wed Dec 19 13:03:56 EST 2012

On 12/19/12 5:47 PM, Henry Gomersall wrote:
> On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote:
>> Not sure which interface is more useful to users. On the one hand,
>> using funny dtypes makes regular non-SIMD access more cumbersome, and
>> it forces your array size to be a multiple of the SIMD word size,
>> which might be inconvenient if your code is smart enough to handle
>> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
>> SIMD for most of the array, and then a slow path to handle any partial
>> word at the end). OTOH, if your code *is* that smart, you should
>> probably just make it smart enough to handle a partial word at the
>> beginning as well and then you won't need any special alignment in the
>> first place, and representing each SIMD word as a single numpy scalar
>> is an intuitively appealing model of how SIMD works. OTOOH, just
>> adding a single argument np.array() is a much simpler to explain than
>> some elaborate scheme involving the creation of special custom dtypes.
> If it helps, my use-case is in wrapping the FFTW library. This _is_
> smart enough to deal with unaligned arrays, but it just results in a
> performance penalty. In the case of an FFT, there are clearly going to
> be issues with the powers of two indices in the array not lying on a
> suitable n-byte boundary (which would be the case with a misaligned
> array), but I imagine it's not unique.
> The other point is that it's easy to create a suitable power of two
> array that should always bypass any special case unaligned code (e.g.
> with floats, any multiple of 4 array length will fill every 16-byte
> word).
> Finally, I think there is significant value in auto-aligning the array
> based on an appropriate inspection of the cpu capabilities (or
> alternatively, a function that reports back the appropriate SIMD
> alignment). Again, this makes it easier to wrap libraries that may
> function with any alignment, but benefit from optimum alignment.

Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on 
systems (Linux and Mac OSX):

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102b97b60, size 8, offset 0 at 0x102e7c130>

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102ba64e0, size 8, offset 0 at 0x102e7c430>

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102b86700, size 8, offset 0 at 0x102e7c5b0>

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102b981d0, size 8, offset 0 at 0x102e7c5f0>

[Check that the last digit in the addresses above is always 0]

The only scenario that I see that this would create unaligned arrays is 
for machines having AVX.  But provided that the Intel architecture is 
making great strides in fetching unaligned data, I'd be surprised that 
the difference in performance would be even noticeable.

Can you tell us which difference in performance are you seeing for an 
AVX-aligned array and other that is not AVX-aligned?  Just curious.

Francesc Alted

More information about the NumPy-Discussion mailing list