[Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed
faltet at gmail.com
Thu Apr 17 12:06:56 EDT 2014
Uh, 15x slower for unaligned access is quite a lot. But Intel (and AMD)
arquitectures are much more tolerant in this aspect (and improving).
For example, with a Xeon(R) CPU E5-2670 (2 years old) I get:
In : import numpy as np
In : shape = (10000, 10000)
In : x_aligned = np.zeros(shape,
In : x_unaligned = np.zeros(shape,
In : %timeit res = x_aligned ** 2
1 loops, best of 3: 289 ms per loop
In : %timeit res = x_unaligned ** 2
1 loops, best of 3: 664 ms per loop
so the added cost in this case is just a bit more than 2x. But you can
also alleviate this overhead if you do a copy that fits in cache prior
to do computations. numexpr does this:
and the results are pretty good:
In : import numexpr as ne
In : %timeit res = ne.evaluate('x_aligned ** 2')
10 loops, best of 3: 133 ms per loop
In : %timeit res = ne.evaluate('x_unaligned ** 2')
10 loops, best of 3: 134 ms per loop
i.e. there is not a significant difference between aligned and unaligned
access to data.
I wonder if the same technique could be applied to NumPy.
El 17/04/14 16:26, Aron Ahmadia ha escrit:
> Hmnn, I wasn't being clear :)
> The default malloc on BlueGene/Q only returns 8 byte alignment, but
> the SIMD units need 32-byte alignment for loads, stores, and
> operations or performance suffers. On the /P the required alignment
> was 16-bytes, but malloc only gave you 8, and trying to perform
> vectorized loads/stores generated alignment exceptions on unaligned
> See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and
> https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides
> 14 for overview, 15 for the effective performance difference between
> the unaligned/aligned code) for some notes on this.
> On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith <njs at pobox.com
> <mailto:njs at pobox.com>> wrote:
> On 17 Apr 2014 15:09, "Aron Ahmadia" <aron at ahmadia.net
> <mailto:aron at ahmadia.net>> wrote:
> > > On the one hand it would be nice to actually know whether
> posix_memalign is important, before making api decisions on this
> > FWIW: On the lightweight IBM cores that the extremely popular
> BlueGene machines were based on, accessing unaligned memory raised
> system faults. The default behavior of these machines was to
> terminate the program if more than 1000 such errors occurred on a
> given process, and an environment variable allowed you to
> terminate the program if *any* unaligned memory access occurred.
> This is because unaligned memory accesses were 15x (or more)
> slower than aligned memory access.
> > The newer /Q chips seem to be a little more forgiving of this,
> but I think one can in general expect allocated memory alignment
> to be an important performance technique for future high
> performance computing architectures.
> Right, this much is true on lots of architectures, and so malloc
> is careful to always return values with sufficient alignment (e.g.
> 8 bytes) to make sure that any standard operation can succeed.
> The question here is whether it will be important to have *even
> more* alignment than malloc gives us by default. A 16 or 32 byte
> wide SIMD instruction might prefer that data have 16 or 32 byte
> alignment, even if normal memory access for the types being
> operated on only requires 4 or 8 byte alignment.
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion