[Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)
Francesc Alted
francesc at continuum.io
Thu Mar 7 12:47:12 EST 2013
On 3/6/13 7:42 PM, Kurt Smith wrote:
> And regarding performance, doing simple timings shows a 30%-ish
> slowdown for unaligned operations:
>
> In [36]: %timeit packed_arr['b']**2
> 100 loops, best of 3: 2.48 ms per loop
>
> In [37]: %timeit aligned_arr['b']**2
> 1000 loops, best of 3: 1.9 ms per loop
Hmm, that clearly depends on the architecture. On my machine:
In [1]: import numpy as np
In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
In [6]: baligned = aligned_arr['b']
In [7]: bpacked = packed_arr['b']
In [8]: %timeit baligned**2
1000 loops, best of 3: 1.96 ms per loop
In [9]: %timeit bpacked**2
100 loops, best of 3: 7.84 ms per loop
That is, the unaligned column is 4x slower (!). numexpr allows somewhat
better results:
In [11]: %timeit numexpr.evaluate('baligned**2')
1000 loops, best of 3: 1.13 ms per loop
In [12]: %timeit numexpr.evaluate('bpacked**2')
1000 loops, best of 3: 865 us per loop
Yes, in this case, the unaligned array goes faster (as much as 30%). I
think the reason is that numexpr optimizes the unaligned access by doing
a copy of the different chunks in internal buffers that fits in L1
cache. Apparently this is very beneficial in this case (not sure why,
though).
>
> Whereas summing shows just a 10%-ish slowdown:
>
> In [38]: %timeit packed_arr['b'].sum()
> 1000 loops, best of 3: 1.29 ms per loop
>
> In [39]: %timeit aligned_arr['b'].sum()
> 1000 loops, best of 3: 1.14 ms per loop
On my machine:
In [14]: %timeit baligned.sum()
1000 loops, best of 3: 1.03 ms per loop
In [15]: %timeit bpacked.sum()
100 loops, best of 3: 3.79 ms per loop
Again, the 4x slowdown is here. Using numexpr:
In [16]: %timeit numexpr.evaluate('sum(baligned)')
100 loops, best of 3: 2.16 ms per loop
In [17]: %timeit numexpr.evaluate('sum(bpacked)')
100 loops, best of 3: 2.08 ms per loop
Again, the unaligned case is (sligthly better). In this case numexpr is
a bit slower that NumPy because sum() is not parallelized internally.
Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy
could help improving unaligned performance. Worth a try?
--
Francesc Alted
More information about the NumPy-Discussion
mailing list