[Numpy-discussion] aligned / unaligned structured dtype behavior

Fri Mar 8 05:22:20 EST 2013

On 3/7/13 7:26 PM, Frédéric Bastien wrote:
> Hi,
>
> It is normal that unaligned access are slower. The hardware have been
> optimized for aligned access. So this is a user choice space vs speed.
> We can't go around that.

Well, my benchmarks apparently say that numexpr can get better 
performance when tackling computations on unaligned arrays (30% 
faster).  This puzzled me a bit yesterday, but after thinking a bit 
about what was happening, the explanation is clear to me now.

The aligned and unaligned arrays were not contiguous, as they had a gap 
between elements (a consequence of the layout of structure arrays): 8 
bytes for the aligned case and 1 byte for the packed one.  The hardware 
of modern machines fetches a complete cache line (64 bytes typically) 
whenever an element is accessed and that means that, even though we are 
only making use of one field in the computations, both fields are 
brought into cache.  That means that, for aligned object, 16 MB (16 
bytes * 1 million elements) are transmitted to the cache, while the 
unaligned object only have to transmit 9 MB (9 bytes * 1 million).  Of 
course, transmitting 16 MB is pretty much work than just 9 MB.

Now, the elements land in cache aligned for the aligned case and 
unaligned for the packed case, and as you say, unaligned access in cache 
is pretty slow for the CPU, and this is the reason why NumPy can take up 
to 4x more time to perform the computation.  So why numexpr is 
performing much better for the packed case?  Well, it turns out that 
numexpr has machinery to detect that an array is unaligned, and does an 
internal copy for every block that is brought to the cache to be 
computed.  This block size is between 1024 elements (8 KB for double 
precision) and 4096 elements when linked with VML support, and that 
means that a copy normally happens at L1 or L2 cache speed, which is 
much faster than memory-to-memory copy. After the copy numexpr can 
perform operations with aligned data at full CPU speed.  The paradox is 
that, by doing more copies, you may end performing faster computations.  
This is the joy of programming with memory hierarchy in mind.

This is to say that there is more in the equation than just if an array 
is aligned or not.  You must take in account how (and how much!) data 
travels from storage to CPU before making assumptions on the performance 
of your programs.

>   We can only minimize the cost of unaligned
> access in some cases, but not all and those optimization depend of the
> CPU. But newer CPU have lowered in cost of unaligned access.
>
> I'm surprised that Theano worked with the unaligned input. I added
> some check to make this raise an error, as we do not support that!
> Francesc, can you check if Theano give the good result? It is possible
> that someone (maybe me), just copy the input to an aligned ndarray
> when we receive an not aligned one. That could explain why it worked,
> but my memory tell me that we raise an error.

It seems to work for me:

In [10]: f = theano.function([a], a**2)

In [11]: f(baligned)
Out[11]: array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

In [12]: f(bpacked)
Out[12]: array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

In [13]: f2 = theano.function([a], a.sum())

In [14]: f2(baligned)
Out[14]: array(1000000.0)

In [15]: f2(bpacked)
Out[15]: array(1000000.0)

>
> As you saw in the number, this is a bad example for Theano as the
> function compiled is too fast . Their is more Theano overhead then
> computation time in that example. We have reduced recently the
> overhead, but we can do more to lower it.

Yeah.  I was mainly curious about how different packages handle 
unaligned arrays.

-- 
Francesc Alted