[Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

Thu Apr 17 15:19:58 EDT 2014

On 17.04.2014 20:30, Francesc Alted wrote:
> El 17/04/14 19:28, Julian Taylor ha escrit:
>> On 17.04.2014 18:06, Francesc Alted wrote:
>>
>>> In [4]: x_unaligned = np.zeros(shape,
>>> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
>> on arrays of this size you won't see alignment issues you are dominated
>> by memory bandwidth. If at all you will only see it if the data fits
>> into the cache.
>> Its also about unaligned to simd vectors not unaligned to basic types.
>> But it doesn't matter anymore on modern x86 cpus. I guess for array data
>> cache line splits should also not be a big concern.
> 
> Yes, that was my point, that in x86 CPUs this is not such a big 
> problem.  But still a factor of 2 is significant, even for CPU-intensive 
> tasks.  For example, computing sin() is affected similarly (sin() is 
> using SIMD, right?):
> 
> In [6]: shape = (10000, 10000)
> 
> In [7]: x_aligned = np.zeros(shape, 
> dtype=[('x',np.float64),('y',np.int64)])['x']
> 
> In [8]: x_unaligned = np.zeros(shape, 
> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
> 
> In [9]: %timeit res = np.sin(x_aligned)
> 1 loops, best of 3: 654 ms per loop
> 
> In [10]: %timeit res = np.sin(x_unaligned)
> 1 loops, best of 3: 1.08 s per loop
> 
> and again, numexpr can deal with that pretty well (using 8 physical 
> cores here):
> 
> In [6]: %timeit res = ne.evaluate('sin(x_aligned)')
> 10 loops, best of 3: 149 ms per loop
> 
> In [7]: %timeit res = ne.evaluate('sin(x_unaligned)')
> 10 loops, best of 3: 151 ms per loop

in this case the unaligned triggers a strided memcpy calling loop to
copy the data into a aligned buffer which is terrible for performance,
even compared to the expensive sin call.
numexpr handles this well as it allows the compiler to replace the
memcpy with inline assembly (a mov instruction).
We could fix that in numpy, though I don't consider it very important,
you usually always have base type aligned memory.

(sin is not a SIMD using function unless you use a vector math library
not supported by numpy directly yet)

> 
> 
>> Aligned allocators are not the only allocator which might be useful in
>> numpy. Modern CPUs also support larger pages than 4K (huge pages up to
>> 1GB in size) which reduces TLB cache misses. Memory of this type
>> typically needs to be allocated with special mmap flags, though newer
>> kernel versions can now also provide this memory to transparent
>> anonymous pages (normal non-file mmaps).
> 
> That's interesting.  In which scenarios do you think that could improve 
> performance?

it might improve all numpy operations dealing with big arrays.
big arrays trigger many large temporaries meaning glibc uses mmap
meaning lots of moving of address space between the kernel and userspace.
but I haven't benchmarked it, so it could also be completely irrelevant.

Also memory fragments really fast, so after a few hours of operation you
often can't allocate any huge pages anymore, so you need to reserve
space for them which requires special setup of machines.

Another possibility for special allocators are numa allocators that
ensure you get memory local to a specific compute node regardless of the
system numa policy.
But again its probably not very important as python has poor thread
scalability anyway, these are just examples for keeping flexibility of
our allocators in numpy and not binding us to what python does.

> 
>>> In [8]: import numexpr as ne
>>>
>>> In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
>>> 10 loops, best of 3: 133 ms per loop
>>>
>>> In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
>>> 10 loops, best of 3: 134 ms per loop
>>>
>>> i.e. there is not a significant difference between aligned and unaligned
>>> access to data.
>>>
>>> I wonder if the same technique could be applied to NumPy.
>>
>> you already can do so with relatively simple means:
>> http://nbviewer.ipython.org/gist/anonymous/10942132
>>
>> If you change the blocking function to get a function as input and use
>> inplace operations numpy can even beat numexpr (though I used the
>> numexpr Ubuntu package which might not be compiled optimally)
>> This type of transformation can probably be applied on the AST quite easily.
> 
> That's smart.  Yeah, I don't see a reason why numexpr would be 
> performing badly on Ubuntu.  But I am not getting your performance for 
> blocked_thread on my AMI linux vbox:
> 
> http://nbviewer.ipython.org/gist/anonymous/11000524

my numexpr amd64 package does not make use of SIMD e.g. sqrt which is
vectorized in numpy:

numexpr:
  1.30 │ 4638:   sqrtss (%r14),%xmm0
  0.01 │         ucomis %xmm0,%xmm0
  0.00 │       ↓ jp     11ec4
  4.14 │ 4646:   movss  %xmm0,(%r15,%r12,1)
       │         add    %rbp,%r14
       │         add    $0x4,%r12
(unrolled a couple times)

vs numpy:
 83.25 │190:   sqrtps (%rbx,%r12,4),%xmm0
  0.52 │       movaps %xmm0,0x0(%rbp,%r12,4)
 14.63 │       add    $0x4,%r12
  1.60 │       cmp    %rdx,%r12
       │     ↑ jb     190

(note the ps vs ss suffix, packed vs scalar)