[Numpy-discussion] aligned / unaligned structured dtype behavior
Francesc Alted
francesc at continuum.io
Fri Mar 8 05:22:20 EST 2013
On 3/7/13 7:26 PM, Frédéric Bastien wrote:
> Hi,
>
> It is normal that unaligned access are slower. The hardware have been
> optimized for aligned access. So this is a user choice space vs speed.
> We can't go around that.
Well, my benchmarks apparently say that numexpr can get better
performance when tackling computations on unaligned arrays (30%
faster). This puzzled me a bit yesterday, but after thinking a bit
about what was happening, the explanation is clear to me now.
The aligned and unaligned arrays were not contiguous, as they had a gap
between elements (a consequence of the layout of structure arrays): 8
bytes for the aligned case and 1 byte for the packed one. The hardware
of modern machines fetches a complete cache line (64 bytes typically)
whenever an element is accessed and that means that, even though we are
only making use of one field in the computations, both fields are
brought into cache. That means that, for aligned object, 16 MB (16
bytes * 1 million elements) are transmitted to the cache, while the
unaligned object only have to transmit 9 MB (9 bytes * 1 million). Of
course, transmitting 16 MB is pretty much work than just 9 MB.
Now, the elements land in cache aligned for the aligned case and
unaligned for the packed case, and as you say, unaligned access in cache
is pretty slow for the CPU, and this is the reason why NumPy can take up
to 4x more time to perform the computation. So why numexpr is
performing much better for the packed case? Well, it turns out that
numexpr has machinery to detect that an array is unaligned, and does an
internal copy for every block that is brought to the cache to be
computed. This block size is between 1024 elements (8 KB for double
precision) and 4096 elements when linked with VML support, and that
means that a copy normally happens at L1 or L2 cache speed, which is
much faster than memory-to-memory copy. After the copy numexpr can
perform operations with aligned data at full CPU speed. The paradox is
that, by doing more copies, you may end performing faster computations.
This is the joy of programming with memory hierarchy in mind.
This is to say that there is more in the equation than just if an array
is aligned or not. You must take in account how (and how much!) data
travels from storage to CPU before making assumptions on the performance
of your programs.
> We can only minimize the cost of unaligned
> access in some cases, but not all and those optimization depend of the
> CPU. But newer CPU have lowered in cost of unaligned access.
>
> I'm surprised that Theano worked with the unaligned input. I added
> some check to make this raise an error, as we do not support that!
> Francesc, can you check if Theano give the good result? It is possible
> that someone (maybe me), just copy the input to an aligned ndarray
> when we receive an not aligned one. That could explain why it worked,
> but my memory tell me that we raise an error.
It seems to work for me:
In [10]: f = theano.function([a], a**2)
In [11]: f(baligned)
Out[11]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [12]: f(bpacked)
Out[12]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [13]: f2 = theano.function([a], a.sum())
In [14]: f2(baligned)
Out[14]: array(1000000.0)
In [15]: f2(bpacked)
Out[15]: array(1000000.0)
>
> As you saw in the number, this is a bad example for Theano as the
> function compiled is too fast . Their is more Theano overhead then
> computation time in that example. We have reduced recently the
> overhead, but we can do more to lower it.
Yeah. I was mainly curious about how different packages handle
unaligned arrays.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list