On 12/20/12 7:35 PM, Henry Gomersall wrote:
On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote:
On 12/20/12 9:53 AM, Henry Gomersall wrote:
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article...
"Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code
http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on... that
uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX."
Though it would be nice to put together a little example of this! Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3). <snip>
All in all, it is not clear that AVX alignment would have an advantage, even for memory-bounded problems. But of course, if Intel people are saying that AVX alignment is important is because they have use cases for asserting this. It is just that I'm having a difficult time to find these cases. Thanks for those examples, they were very interesting. I managed to temporarily get my hands on a machine with AVX and I have shown some speed-up with aligned arrays.
FFT (using my wrappers) gives about a 15% speedup.
Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!).
Okay, so a 15% is significant, yes. I'm still wondering why I did not get any speedup at all using MKL, but probably the reason is that it manages the unaligned corners of the datasets first, and then uses an aligned access for the rest of the data (but just guessing here). -- Francesc Alted