[Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed
Francesc Alted
faltet at gmail.com
Thu Apr 17 14:30:06 EDT 2014
El 17/04/14 19:28, Julian Taylor ha escrit:
> On 17.04.2014 18:06, Francesc Alted wrote:
>
>> In [4]: x_unaligned = np.zeros(shape,
>> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
> on arrays of this size you won't see alignment issues you are dominated
> by memory bandwidth. If at all you will only see it if the data fits
> into the cache.
> Its also about unaligned to simd vectors not unaligned to basic types.
> But it doesn't matter anymore on modern x86 cpus. I guess for array data
> cache line splits should also not be a big concern.
Yes, that was my point, that in x86 CPUs this is not such a big
problem. But still a factor of 2 is significant, even for CPU-intensive
tasks. For example, computing sin() is affected similarly (sin() is
using SIMD, right?):
In [6]: shape = (10000, 10000)
In [7]: x_aligned = np.zeros(shape,
dtype=[('x',np.float64),('y',np.int64)])['x']
In [8]: x_unaligned = np.zeros(shape,
dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
In [9]: %timeit res = np.sin(x_aligned)
1 loops, best of 3: 654 ms per loop
In [10]: %timeit res = np.sin(x_unaligned)
1 loops, best of 3: 1.08 s per loop
and again, numexpr can deal with that pretty well (using 8 physical
cores here):
In [6]: %timeit res = ne.evaluate('sin(x_aligned)')
10 loops, best of 3: 149 ms per loop
In [7]: %timeit res = ne.evaluate('sin(x_unaligned)')
10 loops, best of 3: 151 ms per loop
> Aligned allocators are not the only allocator which might be useful in
> numpy. Modern CPUs also support larger pages than 4K (huge pages up to
> 1GB in size) which reduces TLB cache misses. Memory of this type
> typically needs to be allocated with special mmap flags, though newer
> kernel versions can now also provide this memory to transparent
> anonymous pages (normal non-file mmaps).
That's interesting. In which scenarios do you think that could improve
performance?
>> In [8]: import numexpr as ne
>>
>> In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
>> 10 loops, best of 3: 133 ms per loop
>>
>> In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
>> 10 loops, best of 3: 134 ms per loop
>>
>> i.e. there is not a significant difference between aligned and unaligned
>> access to data.
>>
>> I wonder if the same technique could be applied to NumPy.
>
> you already can do so with relatively simple means:
> http://nbviewer.ipython.org/gist/anonymous/10942132
>
> If you change the blocking function to get a function as input and use
> inplace operations numpy can even beat numexpr (though I used the
> numexpr Ubuntu package which might not be compiled optimally)
> This type of transformation can probably be applied on the AST quite easily.
That's smart. Yeah, I don't see a reason why numexpr would be
performing badly on Ubuntu. But I am not getting your performance for
blocked_thread on my AMI linux vbox:
http://nbviewer.ipython.org/gist/anonymous/11000524
oh well, threads are always tricky beasts. By the way, apparently the
optimal block size for my machine is something like 1 MB, not 128 KB,
although the difference is not big:
http://nbviewer.ipython.org/gist/anonymous/11002751
(thanks to Stefan Van der Walt for the script).
-- Francesc Alted
More information about the NumPy-Discussion
mailing list