aligned / unaligned structured dtype behavior (was: GSOC 2013)

On Wed, Mar 6, 2013 at 12:12 PM, Kurt Smith kwmsmith@gmail.com wrote:
On Wed, Mar 6, 2013 at 4:29 AM, Francesc Alted francesc@continuum.io wrote:
I would not run too much. The example above takes 9 bytes to host the structure, while a `aligned=True` will take 16 bytes. I'd rather let the default as it is, and in case performance is critical, you can always copy the unaligned field to a new (homogeneous) array.
Yes, I can absolutely see the case you're making here, and I made my "vote" with the understanding that `aligned=False` will almost certainly stay the default. Adding 'aligned=True' is simple for me to do, so no harm done.
My case is based on what's the least surprising behavior: C structs / all C compilers, the builtin `struct` module, and ctypes `Structure` subclasses all use padding to ensure aligned fields by default. You can turn this off to get packed structures, but the default behavior in these other places is alignment, which is why I was surprised when I first saw that NumPy structured dtypes are packed by default.
Some surprises with aligned / unaligned arrays:
#-----------------------------
import numpy as np
packed_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=False) aligned_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=True)
packed_arr = np.ones((10**6,), dtype=packed_dt) aligned_arr = np.ones((10**6,), dtype=aligned_dt)
print "all(packed_arr['a'] == aligned_arr['a'])", np.all(packed_arr['a'] == aligned_arr['a']) # True print "all(packed_arr['b'] == aligned_arr['b'])", np.all(packed_arr['b'] == aligned_arr['b']) # True print "all(packed_arr == aligned_arr)", np.all(packed_arr == aligned_arr) # False (!!)
#-----------------------------
I can understand what's likely going on under the covers that makes these arrays not compare equal, but I'd expect that if all columns of two structured arrays are everywhere equal, then the arrays themselves would be everywhere equal. Bug?
And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations:
In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop
In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop
Whereas summing shows just a 10%-ish slowdown:
In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop
In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop

On Wed, 2013-03-06 at 12:42 -0600, Kurt Smith wrote:
On Wed, Mar 6, 2013 at 12:12 PM, Kurt Smith kwmsmith@gmail.com wrote:
On Wed, Mar 6, 2013 at 4:29 AM, Francesc Alted francesc@continuum.io wrote:
I would not run too much. The example above takes 9 bytes to host the structure, while a `aligned=True` will take 16 bytes. I'd rather let the default as it is, and in case performance is critical, you can always copy the unaligned field to a new (homogeneous) array.
Yes, I can absolutely see the case you're making here, and I made my "vote" with the understanding that `aligned=False` will almost certainly stay the default. Adding 'aligned=True' is simple for me to do, so no harm done.
My case is based on what's the least surprising behavior: C structs / all C compilers, the builtin `struct` module, and ctypes `Structure` subclasses all use padding to ensure aligned fields by default. You can turn this off to get packed structures, but the default behavior in these other places is alignment, which is why I was surprised when I first saw that NumPy structured dtypes are packed by default.
Some surprises with aligned / unaligned arrays:
#-----------------------------
import numpy as np
packed_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=False) aligned_dt = np.dtype((('a', 'u1'), ('b', 'u8')), align=True)
packed_arr = np.ones((10**6,), dtype=packed_dt) aligned_arr = np.ones((10**6,), dtype=aligned_dt)
print "all(packed_arr['a'] == aligned_arr['a'])", np.all(packed_arr['a'] == aligned_arr['a']) # True print "all(packed_arr['b'] == aligned_arr['b'])", np.all(packed_arr['b'] == aligned_arr['b']) # True print "all(packed_arr == aligned_arr)", np.all(packed_arr == aligned_arr) # False (!!)
#-----------------------------
I can understand what's likely going on under the covers that makes these arrays not compare equal, but I'd expect that if all columns of two structured arrays are everywhere equal, then the arrays themselves would be everywhere equal. Bug?
Yes and no... equal for structured types seems not implemented, you get the same (wrong) False also with (packed_arr == packed_arr). But if the types are equivalent but np.equal not implemented, just returning False is a bit dangerous I agree. Not sure what the solution is exactly, I think the == operator could really raise an error instead of eating them all though probably...
- Sebastian
And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations:
In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop
In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop
Whereas summing shows just a 10%-ish slowdown:
In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop
In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On 3/6/13 7:42 PM, Kurt Smith wrote:
And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations:
In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop
In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop
Hmm, that clearly depends on the architecture. On my machine:
In [1]: import numpy as np
In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
In [6]: baligned = aligned_arr['b']
In [7]: bpacked = packed_arr['b']
In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop
In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop
That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results:
In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop
In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop
Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though).
Whereas summing shows just a 10%-ish slowdown:
In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop
In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop
On my machine:
In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop
In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop
Again, the 4x slowdown is here. Using numexpr:
In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop
In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop
Again, the unaligned case is (sligthly better). In this case numexpr is a bit slower that NumPy because sum() is not parallelized internally. Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy could help improving unaligned performance. Worth a try?

On 3/7/13 6:47 PM, Francesc Alted wrote:
On 3/6/13 7:42 PM, Kurt Smith wrote:
And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations:
In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop
In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop
Hmm, that clearly depends on the architecture. On my machine:
In [1]: import numpy as np
In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
In [6]: baligned = aligned_arr['b']
In [7]: bpacked = packed_arr['b']
In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop
In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop
That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results:
In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop
In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop
Just for completeness, here it is what Theano gets:
In [18]: import theano
In [20]: a = theano.tensor.vector()
In [22]: f = theano.function([a], a**2)
In [23]: %timeit f(baligned) 100 loops, best of 3: 7.74 ms per loop
In [24]: %timeit f(bpacked) 100 loops, best of 3: 12.6 ms per loop
So yeah, Theano is also slower for the unaligned case (but less than 2x in this case).
Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though).
Whereas summing shows just a 10%-ish slowdown:
In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop
In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop
On my machine:
In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop
In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop
Again, the 4x slowdown is here. Using numexpr:
In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop
In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop
And with Theano:
In [26]: f2 = theano.function([a], a.sum())
In [27]: %timeit f2(baligned) 100 loops, best of 3: 2.52 ms per loop
In [28]: %timeit f2(bpacked) 100 loops, best of 3: 7.43 ms per loop
Again, the unaligned case is significantly slower (as much as 3x here!).

Hi,
It is normal that unaligned access are slower. The hardware have been optimized for aligned access. So this is a user choice space vs speed. We can't go around that. We can only minimize the cost of unaligned access in some cases, but not all and those optimization depend of the CPU. But newer CPU have lowered in cost of unaligned access.
I'm surprised that Theano worked with the unaligned input. I added some check to make this raise an error, as we do not support that! Francesc, can you check if Theano give the good result? It is possible that someone (maybe me), just copy the input to an aligned ndarray when we receive an not aligned one. That could explain why it worked, but my memory tell me that we raise an error.
As you saw in the number, this is a bad example for Theano as the function compiled is too fast . Their is more Theano overhead then computation time in that example. We have reduced recently the overhead, but we can do more to lower it.
Fred
On Thu, Mar 7, 2013 at 1:06 PM, Francesc Alted francesc@continuum.io wrote:
On 3/7/13 6:47 PM, Francesc Alted wrote:
On 3/6/13 7:42 PM, Kurt Smith wrote:
And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations:
In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop
In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop
Hmm, that clearly depends on the architecture. On my machine:
In [1]: import numpy as np
In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
In [6]: baligned = aligned_arr['b']
In [7]: bpacked = packed_arr['b']
In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop
In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop
That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results:
In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop
In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop
Just for completeness, here it is what Theano gets:
In [18]: import theano
In [20]: a = theano.tensor.vector()
In [22]: f = theano.function([a], a**2)
In [23]: %timeit f(baligned) 100 loops, best of 3: 7.74 ms per loop
In [24]: %timeit f(bpacked) 100 loops, best of 3: 12.6 ms per loop
So yeah, Theano is also slower for the unaligned case (but less than 2x in this case).
Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though).
Whereas summing shows just a 10%-ish slowdown:
In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop
In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop
On my machine:
In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop
In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop
Again, the 4x slowdown is here. Using numexpr:
In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop
In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop
And with Theano:
In [26]: f2 = theano.function([a], a.sum())
In [27]: %timeit f2(baligned) 100 loops, best of 3: 2.52 ms per loop
In [28]: %timeit f2(bpacked) 100 loops, best of 3: 7.43 ms per loop
Again, the unaligned case is significantly slower (as much as 3x here!).
-- Francesc Alted
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Mar 7, 2013 at 12:26 PM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
It is normal that unaligned access are slower. The hardware have been optimized for aligned access. So this is a user choice space vs speed.
The quantitative difference is still important, so this thread is useful for future reference, I think. If reading in data into a packed array is 3x faster than reading into an aligned array, but the core computation is 4x slower with a packed array...you get the idea.
I would have benefitted years ago knowing (1) numpy structured dtypes are packed by default, and (2) computations with unaligned data can be several factors slower than aligned. That's strong motivation to always make sure I'm using 'aligned=True' except when memory usage is an issue, or for file IO with packed binary data, etc.
We can't go around that. We can only minimize the cost of unaligned access in some cases, but not all and those optimization depend of the CPU. But newer CPU have lowered in cost of unaligned access.
I'm surprised that Theano worked with the unaligned input. I added some check to make this raise an error, as we do not support that! Francesc, can you check if Theano give the good result? It is possible that someone (maybe me), just copy the input to an aligned ndarray when we receive an not aligned one. That could explain why it worked, but my memory tell me that we raise an error.
As you saw in the number, this is a bad example for Theano as the function compiled is too fast . Their is more Theano overhead then computation time in that example. We have reduced recently the overhead, but we can do more to lower it.
Fred
On Thu, Mar 7, 2013 at 1:06 PM, Francesc Alted francesc@continuum.io wrote:
On 3/7/13 6:47 PM, Francesc Alted wrote:
On 3/6/13 7:42 PM, Kurt Smith wrote:
And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations:
In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop
In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop
Hmm, that clearly depends on the architecture. On my machine:
In [1]: import numpy as np
In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
In [6]: baligned = aligned_arr['b']
In [7]: bpacked = packed_arr['b']
In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop
In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop
That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results:
In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop
In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop
Just for completeness, here it is what Theano gets:
In [18]: import theano
In [20]: a = theano.tensor.vector()
In [22]: f = theano.function([a], a**2)
In [23]: %timeit f(baligned) 100 loops, best of 3: 7.74 ms per loop
In [24]: %timeit f(bpacked) 100 loops, best of 3: 12.6 ms per loop
So yeah, Theano is also slower for the unaligned case (but less than 2x in this case).
Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though).
Whereas summing shows just a 10%-ish slowdown:
In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop
In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop
On my machine:
In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop
In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop
Again, the 4x slowdown is here. Using numexpr:
In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop
In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop
And with Theano:
In [26]: f2 = theano.function([a], a.sum())
In [27]: %timeit f2(baligned) 100 loops, best of 3: 2.52 ms per loop
In [28]: %timeit f2(bpacked) 100 loops, best of 3: 7.43 ms per loop
Again, the unaligned case is significantly slower (as much as 3x here!).
-- Francesc Alted
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

I agree that documenting this better would be useful to many people.
So if someone what to summarize this and put it in the doc, I think many people will appreciate this.
Fred
On Thu, Mar 7, 2013 at 10:28 PM, Kurt Smith kwmsmith@gmail.com wrote:
On Thu, Mar 7, 2013 at 12:26 PM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
It is normal that unaligned access are slower. The hardware have been optimized for aligned access. So this is a user choice space vs speed.
The quantitative difference is still important, so this thread is useful for future reference, I think. If reading in data into a packed array is 3x faster than reading into an aligned array, but the core computation is 4x slower with a packed array...you get the idea.
I would have benefitted years ago knowing (1) numpy structured dtypes are packed by default, and (2) computations with unaligned data can be several factors slower than aligned. That's strong motivation to always make sure I'm using 'aligned=True' except when memory usage is an issue, or for file IO with packed binary data, etc.
We can't go around that. We can only minimize the cost of unaligned access in some cases, but not all and those optimization depend of the CPU. But newer CPU have lowered in cost of unaligned access.
I'm surprised that Theano worked with the unaligned input. I added some check to make this raise an error, as we do not support that! Francesc, can you check if Theano give the good result? It is possible that someone (maybe me), just copy the input to an aligned ndarray when we receive an not aligned one. That could explain why it worked, but my memory tell me that we raise an error.
As you saw in the number, this is a bad example for Theano as the function compiled is too fast . Their is more Theano overhead then computation time in that example. We have reduced recently the overhead, but we can do more to lower it.
Fred
On Thu, Mar 7, 2013 at 1:06 PM, Francesc Alted francesc@continuum.io wrote:
On 3/7/13 6:47 PM, Francesc Alted wrote:
On 3/6/13 7:42 PM, Kurt Smith wrote:
And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations:
In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop
In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop
Hmm, that clearly depends on the architecture. On my machine:
In [1]: import numpy as np
In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
In [6]: baligned = aligned_arr['b']
In [7]: bpacked = packed_arr['b']
In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop
In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop
That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results:
In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop
In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop
Just for completeness, here it is what Theano gets:
In [18]: import theano
In [20]: a = theano.tensor.vector()
In [22]: f = theano.function([a], a**2)
In [23]: %timeit f(baligned) 100 loops, best of 3: 7.74 ms per loop
In [24]: %timeit f(bpacked) 100 loops, best of 3: 12.6 ms per loop
So yeah, Theano is also slower for the unaligned case (but less than 2x in this case).
Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though).
Whereas summing shows just a 10%-ish slowdown:
In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop
In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop
On my machine:
In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop
In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop
Again, the 4x slowdown is here. Using numexpr:
In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop
In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop
And with Theano:
In [26]: f2 = theano.function([a], a.sum())
In [27]: %timeit f2(baligned) 100 loops, best of 3: 2.52 ms per loop
In [28]: %timeit f2(bpacked) 100 loops, best of 3: 7.43 ms per loop
Again, the unaligned case is significantly slower (as much as 3x here!).
-- Francesc Alted
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On 3/7/13 7:26 PM, Frédéric Bastien wrote:
Hi,
It is normal that unaligned access are slower. The hardware have been optimized for aligned access. So this is a user choice space vs speed. We can't go around that.
Well, my benchmarks apparently say that numexpr can get better performance when tackling computations on unaligned arrays (30% faster). This puzzled me a bit yesterday, but after thinking a bit about what was happening, the explanation is clear to me now.
The aligned and unaligned arrays were not contiguous, as they had a gap between elements (a consequence of the layout of structure arrays): 8 bytes for the aligned case and 1 byte for the packed one. The hardware of modern machines fetches a complete cache line (64 bytes typically) whenever an element is accessed and that means that, even though we are only making use of one field in the computations, both fields are brought into cache. That means that, for aligned object, 16 MB (16 bytes * 1 million elements) are transmitted to the cache, while the unaligned object only have to transmit 9 MB (9 bytes * 1 million). Of course, transmitting 16 MB is pretty much work than just 9 MB.
Now, the elements land in cache aligned for the aligned case and unaligned for the packed case, and as you say, unaligned access in cache is pretty slow for the CPU, and this is the reason why NumPy can take up to 4x more time to perform the computation. So why numexpr is performing much better for the packed case? Well, it turns out that numexpr has machinery to detect that an array is unaligned, and does an internal copy for every block that is brought to the cache to be computed. This block size is between 1024 elements (8 KB for double precision) and 4096 elements when linked with VML support, and that means that a copy normally happens at L1 or L2 cache speed, which is much faster than memory-to-memory copy. After the copy numexpr can perform operations with aligned data at full CPU speed. The paradox is that, by doing more copies, you may end performing faster computations. This is the joy of programming with memory hierarchy in mind.
This is to say that there is more in the equation than just if an array is aligned or not. You must take in account how (and how much!) data travels from storage to CPU before making assumptions on the performance of your programs.
We can only minimize the cost of unaligned access in some cases, but not all and those optimization depend of the CPU. But newer CPU have lowered in cost of unaligned access.
I'm surprised that Theano worked with the unaligned input. I added some check to make this raise an error, as we do not support that! Francesc, can you check if Theano give the good result? It is possible that someone (maybe me), just copy the input to an aligned ndarray when we receive an not aligned one. That could explain why it worked, but my memory tell me that we raise an error.
It seems to work for me:
In [10]: f = theano.function([a], a**2)
In [11]: f(baligned) Out[11]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [12]: f(bpacked) Out[12]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [13]: f2 = theano.function([a], a.sum())
In [14]: f2(baligned) Out[14]: array(1000000.0)
In [15]: f2(bpacked) Out[15]: array(1000000.0)
As you saw in the number, this is a bad example for Theano as the function compiled is too fast . Their is more Theano overhead then computation time in that example. We have reduced recently the overhead, but we can do more to lower it.
Yeah. I was mainly curious about how different packages handle unaligned arrays.

On Fri, Mar 8, 2013 at 5:22 AM, Francesc Alted francesc@continuum.io wrote:
On 3/7/13 7:26 PM, Frédéric Bastien wrote:
I'm surprised that Theano worked with the unaligned input. I added some check to make this raise an error, as we do not support that! Francesc, can you check if Theano give the good result? It is possible that someone (maybe me), just copy the input to an aligned ndarray when we receive an not aligned one. That could explain why it worked, but my memory tell me that we raise an error.
It seems to work for me:
In [10]: f = theano.function([a], a**2)
In [11]: f(baligned) Out[11]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [12]: f(bpacked) Out[12]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [13]: f2 = theano.function([a], a.sum())
In [14]: f2(baligned) Out[14]: array(1000000.0)
In [15]: f2(bpacked) Out[15]: array(1000000.0)
I understand what happen. You declare the symbolic variable like this:
a = theano.tensor.vector()
This create a symbolic variable with dtype floatX that is float64 by default. baligned and bpacked are of dtype int64.
When a Theano function receive as input an ndarray of the wrong dtype, we try to cast it to the good dtype and check we don't loose precission. As the input are only 1s, there is no lost of precission, so the input is silently accepted and copied. So when we check later for the aligned flags, it pass.
If you change the symbolic variable to have a dtype of int64, there won't be a copy and we will see the error:
a = theano.tensor.lvector() f = theano.function([a], a ** 2) f(bpacked)
TypeError: ('Bad input argument to theano function at index 0(0-based)', 'The numpy.ndarray object is not aligned. Theano C code does not support that.', '', 'object shape', (1000000,), 'object strides', (9,))
If I time now this new function I have:
In [14]: timeit baligned**2 100 loops, best of 3: 7.5 ms per loop
In [15]: timeit bpacked**2 100 loops, best of 3: 8.25 ms per loop
In [16]: timeit f(baligned) 100 loops, best of 3: 7.36 ms per loop
So the Theano overhead was the copy in this case. It is not the first time I saw this. We added the automatic cast to allow specifing most python int/list/real as input.
Fred

On Thu, Mar 7, 2013 at 11:47 AM, Francesc Alted francesc@continuum.io wrote:
On 3/6/13 7:42 PM, Kurt Smith wrote:
Hmm, that clearly depends on the architecture. On my machine: ... That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results: ... Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though).
On my machine: ... Again, the 4x slowdown is here. Using numexpr: ... Again, the unaligned case is (sligthly better). In this case numexpr is a bit slower that NumPy because sum() is not parallelized internally. Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy could help improving unaligned performance. Worth a try?
Very interesting -- thanks for sharing.
-- Francesc Alted
participants (4)
-
Francesc Alted
-
Frédéric Bastien
-
Kurt Smith
-
Sebastian Berg