[Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

Francesc Alted faltet at gmail.com
Fri Apr 18 07:39:22 EDT 2014

El 17/04/14 21:19, Julian Taylor ha escrit:
> On 17.04.2014 20:30, Francesc Alted wrote:
>> El 17/04/14 19:28, Julian Taylor ha escrit:
>>> On 17.04.2014 18:06, Francesc Alted wrote:
>>>> In [4]: x_unaligned = np.zeros(shape,
>>>> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
>>> on arrays of this size you won't see alignment issues you are dominated
>>> by memory bandwidth. If at all you will only see it if the data fits
>>> into the cache.
>>> Its also about unaligned to simd vectors not unaligned to basic types.
>>> But it doesn't matter anymore on modern x86 cpus. I guess for array data
>>> cache line splits should also not be a big concern.
>> Yes, that was my point, that in x86 CPUs this is not such a big
>> problem.  But still a factor of 2 is significant, even for CPU-intensive
>> tasks.  For example, computing sin() is affected similarly (sin() is
>> using SIMD, right?):
>> In [6]: shape = (10000, 10000)
>> In [7]: x_aligned = np.zeros(shape,
>> dtype=[('x',np.float64),('y',np.int64)])['x']
>> In [8]: x_unaligned = np.zeros(shape,
>> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
>> In [9]: %timeit res = np.sin(x_aligned)
>> 1 loops, best of 3: 654 ms per loop
>> In [10]: %timeit res = np.sin(x_unaligned)
>> 1 loops, best of 3: 1.08 s per loop
>> and again, numexpr can deal with that pretty well (using 8 physical
>> cores here):
>> In [6]: %timeit res = ne.evaluate('sin(x_aligned)')
>> 10 loops, best of 3: 149 ms per loop
>> In [7]: %timeit res = ne.evaluate('sin(x_unaligned)')
>> 10 loops, best of 3: 151 ms per loop
> in this case the unaligned triggers a strided memcpy calling loop to
> copy the data into a aligned buffer which is terrible for performance,
> even compared to the expensive sin call.
> numexpr handles this well as it allows the compiler to replace the
> memcpy with inline assembly (a mov instruction).
> We could fix that in numpy, though I don't consider it very important,
> you usually always have base type aligned memory.

Well, that *could* be important for evaluating conditions in structured 
arrays, as it is pretty easy to get unaligned 'columns'. But apparently 
this does not affect very much to numpy:

In [23]: na_aligned = np.fromiter((("", i, i*2) for i in xrange(N)), 

In [24]: na_unaligned = np.fromiter((("", i, i*2) for i in xrange(N)), 

In [25]: %time sum((r['f1'] for r in na_aligned[na_aligned['f2'] > 10]))
CPU times: user 10.2 s, sys: 93 ms, total: 10.3 s
Wall time: 10.3 s
Out[25]: 49999994999985

In [26]: %time sum((r['f1'] for r in na_unaligned[na_unaligned['f2'] > 10]))
CPU times: user 10.2 s, sys: 82 ms, total: 10.3 s
Wall time: 10.3 s
Out[26]: 49999994999985

probably because the bottleneck is in another place.  So yeah, probably 
not worth to worry about that.

> (sin is not a SIMD using function unless you use a vector math library
> not supported by numpy directly yet)

Ah, so MKL is making use of SIMD for computing the sin(), but not in 
general.  But you later said that numpy's sqrt *is* making use of SIMD.  
I wonder why.

>>> Aligned allocators are not the only allocator which might be useful in
>>> numpy. Modern CPUs also support larger pages than 4K (huge pages up to
>>> 1GB in size) which reduces TLB cache misses. Memory of this type
>>> typically needs to be allocated with special mmap flags, though newer
>>> kernel versions can now also provide this memory to transparent
>>> anonymous pages (normal non-file mmaps).
>> That's interesting.  In which scenarios do you think that could improve
>> performance?
> it might improve all numpy operations dealing with big arrays.
> big arrays trigger many large temporaries meaning glibc uses mmap
> meaning lots of moving of address space between the kernel and userspace.
> but I haven't benchmarked it, so it could also be completely irrelevant.

I was curious about this and apparently the speedups that typically 
bring large page caches is around 5%:


not a big deal, but it is something.

> Also memory fragments really fast, so after a few hours of operation you
> often can't allocate any huge pages anymore, so you need to reserve
> space for them which requires special setup of machines.
> Another possibility for special allocators are numa allocators that
> ensure you get memory local to a specific compute node regardless of the
> system numa policy.
> But again its probably not very important as python has poor thread
> scalability anyway, these are just examples for keeping flexibility of
> our allocators in numpy and not binding us to what python does.


> That's smart.  Yeah, I don't see a reason why numexpr would be
> performing badly on Ubuntu.  But I am not getting your performance for
> blocked_thread on my AMI linux vbox:
> http://nbviewer.ipython.org/gist/anonymous/11000524
> my numexpr amd64 package does not make use of SIMD e.g. sqrt which is
> vectorized in numpy:
> numexpr:
>    1.30 │ 4638:   sqrtss (%r14),%xmm0
>    0.01 │         ucomis %xmm0,%xmm0
>    0.00 │       ↓ jp     11ec4
>    4.14 │ 4646:   movss  %xmm0,(%r15,%r12,1)
>         │         add    %rbp,%r14
>         │         add    $0x4,%r12
> (unrolled a couple times)
> vs numpy:
>   83.25 │190:   sqrtps (%rbx,%r12,4),%xmm0
>    0.52 │       movaps %xmm0,0x0(%rbp,%r12,4)
>   14.63 │       add    $0x4,%r12
>    1.60 │       cmp    %rdx,%r12
>         │     ↑ jb     190
> (note the ps vs ss suffix, packed vs scalar)

Yup, I can reproduce that:

In [4]: a = np.random.rand(int(1e8))

In [5]: %timeit np.sqrt(a)
1 loops, best of 3: 558 ms per loop

In [6]: %timeit ne.evaluate('sqrt(a)')
1 loops, best of 3: 249 ms per loop

In [7]: ne.set_num_threads(1)
Out[7]: 8

In [8]: %timeit ne.evaluate('sqrt(a)')
1 loops, best of 3: 924 ms per loop

So, yes, the non-SIMD version of sqrt in numexpr is performing quite 
more slowly than the SIMD one in NumPy.  Of course, a numexpr compiled 
with MKL support can achieve similar performance than numpy in single 
thread mode:

In [4]: %timeit ne.evaluate('sqrt(a)')
1 loops, best of 3: 191 ms per loop

In [5]: ne.set_num_threads(1)
Out[5]: 8

In [6]: %timeit ne.evaluate('sqrt(a)')
1 loops, best of 3: 565 ms per loop

So, sqrt in numpy has barely the same speed than the one in MKL. Again, 
I wonder why :)

Francesc Alted

More information about the NumPy-Discussion mailing list