On Wed, Dec 19, 2012 at 6:03 PM, Francesc Alted <francesc@continuum.io> wrote:
On 12/19/12 5:47 PM, Henry Gomersall wrote:
Not sure which interface is more useful to users. On the one hand, using funny dtypes makes regular non-SIMD access more cumbersome, and it forces your array size to be a multiple of the SIMD word size, which might be inconvenient if your code is smart enough to handle arbitrary-sized arrays with partial SIMD acceleration (i.e., using SIMD for most of the array, and then a slow path to handle any partial word at the end). OTOH, if your code *is* that smart, you should probably just make it smart enough to handle a partial word at the beginning as well and then you won't need any special alignment in the first place, and representing each SIMD word as a single numpy scalar is an intuitively appealing model of how SIMD works. OTOOH, just adding a single argument np.array() is a much simpler to explain than some elaborate scheme involving the creation of special custom dtypes. If it helps, my use-case is in wrapping the FFTW library. This _is_ smart enough to deal with unaligned arrays, but it just results in a
On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote: performance penalty. In the case of an FFT, there are clearly going to be issues with the powers of two indices in the array not lying on a suitable n-byte boundary (which would be the case with a misaligned array), but I imagine it's not unique.
The other point is that it's easy to create a suitable power of two array that should always bypass any special case unaligned code (e.g. with floats, any multiple of 4 array length will fill every 16-byte word).
Finally, I think there is significant value in auto-aligning the array based on an appropriate inspection of the cpu capabilities (or alternatively, a function that reports back the appropriate SIMD alignment). Again, this makes it easier to wrap libraries that may function with any alignment, but benefit from optimum alignment.
Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on systems (Linux and Mac OSX):
Only by accident, at least on linux. The pointers returned by the gnu libc malloc are at least 8 bytes aligned, but they may not be 16 bytes when you're above the threshold where mmap is used for malloc. The difference between aligned and unaligned ram <-> sse registers (e.g. movaps, movups) used to be significant. Don't know if that's still the case for recent CPUs. David