[Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)

Thu Mar 7 12:47:12 EST 2013

On 3/6/13 7:42 PM, Kurt Smith wrote:
> And regarding performance, doing simple timings shows a 30%-ish
> slowdown for unaligned operations:
>
> In [36]: %timeit packed_arr['b']**2
> 100 loops, best of 3: 2.48 ms per loop
>
> In [37]: %timeit aligned_arr['b']**2
> 1000 loops, best of 3: 1.9 ms per loop

Hmm, that clearly depends on the architecture.  On my machine:

In [1]: import numpy as np

In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)

In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)

In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)

In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)

In [6]: baligned = aligned_arr['b']

In [7]: bpacked = packed_arr['b']

In [8]: %timeit baligned**2
1000 loops, best of 3: 1.96 ms per loop

In [9]: %timeit bpacked**2
100 loops, best of 3: 7.84 ms per loop

That is, the unaligned column is 4x slower (!).  numexpr allows somewhat 
better results:

In [11]: %timeit numexpr.evaluate('baligned**2')
1000 loops, best of 3: 1.13 ms per loop

In [12]: %timeit numexpr.evaluate('bpacked**2')
1000 loops, best of 3: 865 us per loop

Yes, in this case, the unaligned array goes faster (as much as 30%).  I 
think the reason is that numexpr optimizes the unaligned access by doing 
a copy of the different chunks in internal buffers that fits in L1 
cache.  Apparently this is very beneficial in this case (not sure why, 
though).

>
> Whereas summing shows just a 10%-ish slowdown:
>
> In [38]: %timeit packed_arr['b'].sum()
> 1000 loops, best of 3: 1.29 ms per loop
>
> In [39]: %timeit aligned_arr['b'].sum()
> 1000 loops, best of 3: 1.14 ms per loop

On my machine:

In [14]: %timeit baligned.sum()
1000 loops, best of 3: 1.03 ms per loop

In [15]: %timeit bpacked.sum()
100 loops, best of 3: 3.79 ms per loop

Again, the 4x slowdown is here.  Using numexpr:

In [16]: %timeit numexpr.evaluate('sum(baligned)')
100 loops, best of 3: 2.16 ms per loop

In [17]: %timeit numexpr.evaluate('sum(bpacked)')
100 loops, best of 3: 2.08 ms per loop

Again, the unaligned case is (sligthly better).  In this case numexpr is 
a bit slower that NumPy because sum() is not parallelized internally.  
Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy 
could help improving unaligned performance. Worth a try?

-- 
Francesc Alted