[Numpy-discussion] Byte aligned arrays

Fri Dec 21 05:34:39 EST 2012

On 12/20/12 7:35 PM, Henry Gomersall wrote:
> On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote:
>> On 12/20/12 9:53 AM, Henry Gomersall wrote:
>>> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>>>> The only scenario that I see that this would create unaligned
>> arrays
>>>> is
>>>> for machines having AVX.  But provided that the Intel architecture
>> is
>>>> making great strides in fetching unaligned data, I'd be surprised
>>>> that
>>>> the difference in performance would be even noticeable.
>>>>
>>>> Can you tell us which difference in performance are you seeing for
>> an
>>>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
>>> Further to this point, from an Intel article...
>>>
>>>
>> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>>> "Aligning data to vector length is always recommended. When using
>> Intel
>>> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
>>> bytes. Similarly, to achieve best results use Intel AVX instructions
>> on
>>> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
>>> instructions on unaligned 32-byte vectors means that every second
>> load
>>> will be across a cache-line split, since the cache line is 64 bytes.
>>> This doubles the cache line split rate compared to Intel SSE code
>> that
>>> uses 16-byte vectors. A high cache-line split rate in
>> memory-intensive
>>> code is extremely likely to cause performance degradation. For that
>>> reason, it is highly recommended to align the data to 32 bytes for
>> use
>>> with Intel AVX."
>>>
>>> Though it would be nice to put together a little example of this!
>> Indeed, an example is what I was looking for.  So provided that I
>> have
>> access to an AVX capable machine (having 6 physical cores), and that
>> MKL
>> 10.3 has support for AVX, I have made some comparisons using the
>> Anaconda Python distribution (it ships with most packages linked
>> against
>> MKL 10.3).
> <snip>
>
>> All in all, it is not clear that AVX alignment would have an
>> advantage,
>> even for memory-bounded problems.  But of course, if Intel people are
>> saying that AVX alignment is important is because they have use cases
>> for asserting this.  It is just that I'm having a difficult time to
>> find
>> these cases.
> Thanks for those examples, they were very interesting. I managed to
> temporarily get my hands on a machine with AVX and I have shown some
> speed-up with aligned arrays.
>
> FFT (using my wrappers) gives about a 15% speedup.
>
> Also this convolution code:
> https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
>
> Shows a small but repeatable speed-up (a few %) when using some aligned
> loads (as many as I can work out to use!).

Okay, so a 15% is significant, yes.  I'm still wondering why I did not 
get any speedup at all using MKL, but probably the reason is that it 
manages the unaligned corners of the datasets first, and then uses an 
aligned access for the rest of the data (but just guessing here).

-- 
Francesc Alted