[Numpy-discussion] Byte aligned arrays

Fri Dec 21 07:35:36 EST 2012

On 12/20/2012 03:23 PM, Francesc Alted wrote:
> On 12/20/12 9:53 AM, Henry Gomersall wrote:
>> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>>> The only scenario that I see that this would create unaligned arrays
>>> is
>>> for machines having AVX.  But provided that the Intel architecture is
>>> making great strides in fetching unaligned data, I'd be surprised
>>> that
>>> the difference in performance would be even noticeable.
>>>
>>> Can you tell us which difference in performance are you seeing for an
>>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
>> Further to this point, from an Intel article...
>>
>> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>>
>> "Aligning data to vector length is always recommended. When using Intel
>> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
>> bytes. Similarly, to achieve best results use Intel AVX instructions on
>> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
>> instructions on unaligned 32-byte vectors means that every second load
>> will be across a cache-line split, since the cache line is 64 bytes.
>> This doubles the cache line split rate compared to Intel SSE code that
>> uses 16-byte vectors. A high cache-line split rate in memory-intensive
>> code is extremely likely to cause performance degradation. For that
>> reason, it is highly recommended to align the data to 32 bytes for use
>> with Intel AVX."
>>
>> Though it would be nice to put together a little example of this!
>
> Indeed, an example is what I was looking for.  So provided that I have
> access to an AVX capable machine (having 6 physical cores), and that MKL
> 10.3 has support for AVX, I have made some comparisons using the
> Anaconda Python distribution (it ships with most packages linked against
> MKL 10.3).
>
> Here it is a first example using a DGEMM operation.  First using a NumPy
> that is not turbo-loaded with MKL:
>
> In [34]: a = np.linspace(0,1,1e7)
>
> In [35]: b = a.reshape(1000, 10000)
>
> In [36]: c = a.reshape(10000, 1000)
>
> In [37]: time d = np.dot(b,c)
> CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s
> Wall time: 7.63 s
>
> In [38]: time d = np.dot(c,b)
> CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s
> Wall time: 78.89 s
>
> This is getting around 2.6 GFlop/s.  Now, with a MKL 10.3 NumPy and
> AVX-unaligned data:
>
> In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p))
> Out[7]: '0x7fcdef3b4010'  # 16 bytes alignment
>
> In [8]: a = np.ndarray(1e7, "f8", p)
>
> In [9]: a[:] = np.linspace(0,1,1e7)
>
> In [10]: b = a.reshape(1000, 10000)
>
> In [11]: c = a.reshape(10000, 1000)
>
> In [37]: %timeit d = np.dot(b,c)
> 10 loops, best of 3: 164 ms per loop
>
> In [38]: %timeit d = np.dot(c,b)
> 1 loops, best of 3: 1.65 s per loop
>
> That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).
>
> Now, using MKL 10.3 and AVX-aligned data:
>
> In [21]: p2 = ctypes.create_string_buffer(int(8e7+16));
> hex(ctypes.addressof(p))
> Out[21]: '0x7f8cb9598010'
>
> In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:]  # skip the first 16 bytes
> (now is 32-bytes aligned)
>
> In [23]: a2[:] = np.linspace(0,1,1e7)
>
> In [24]: b2 = a2.reshape(1000, 10000)
>
> In [25]: c2 = a2.reshape(10000, 1000)
>
> In [35]: %timeit d2 = np.dot(b2,c2)
> 10 loops, best of 3: 163 ms per loop
>
> In [36]: %timeit d2 = np.dot(c2,b2)
> 1 loops, best of 3: 1.67 s per loop
>
> So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX
> data is negligible.
>
> One may argue that DGEMM is CPU-bounded and that memory access plays
> little role here, and that is certainly true.  So, let's go with a more
> memory-bounded problem, like computing a transcendental function with
> numexpr.  First with a with NumPy and numexpr with no MKL support:
>
> In [8]: a = np.linspace(0,1,1e8)
>
> In [9]: %time b = np.sin(a)
> CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s
> Wall time: 1.42 s
>
> In [10]: import numexpr as ne
>
> In [12]: %time b = ne.evaluate("sin(a)")
> CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s
> Wall time: 0.37 s
>
> This time is around 4x faster than regular 'sin' in libc, and about the
> same speed than a memcpy():
>
> In [13]: %time c = a.copy()
> CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s
> Wall time: 0.39 s
>
> Now, with a MKL-aware numexpr and non-AVX alignment:
>
> In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p))
> Out[8]: '0x7fce435da010'  # 16 bytes alignment
>
> In [9]: a = np.ndarray(1e8, "f8", p)
>
> In [10]: a[:] = np.linspace(0,1,1e8)
>
> In [11]: %time b = ne.evaluate("sin(a)")
> CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s
> Wall time: 0.15 s
>
> That is, more than 2x faster than a memcpy() in this system, meaning
> that the problem is truly memory-bounded.  So now, with an AVX aligned
> buffer:
>
> In [14]: a2 = a[2:]  # skip the first 16 bytes
>
> In [15]: %time b = ne.evaluate("sin(a2)")
> CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s
> Wall time: 0.16 s
>
> Again, times are very close.  Just to make sure, let's use the timeit magic:
>
> In [16]: %timeit b = ne.evaluate("sin(a)")
> 10 loops, best of 3: 159 ms per loop   # unaligned
>
> In [17]: %timeit b = ne.evaluate("sin(a2)")
> 10 loops, best of 3: 154 ms per loop   # aligned
>
> All in all, it is not clear that AVX alignment would have an advantage,
> even for memory-bounded problems.  But of course, if Intel people are
> saying that AVX alignment is important is because they have use cases
> for asserting this.  It is just that I'm having a difficult time to find
> these cases.

Hmm, I think it is the opposite, that it is for CPU-bound problems that 
alignment would have an effect? I.e. the MOVUPD would be doing some 
shuffling etc. to get around the non-alignment, which only matters if 
the data is already in cache.

(There are other instructions, like the STREAM instructions and the 
direct writes and so on, which are much more important for the 
non-cached case. At least that's my understanding.)

Dag Sverre