[Numpy-discussion] performance of numpy.array()

Wed Apr 29 14:13:59 EDT 2015

On 29.04.2015 17:50, Robert Kern wrote:
> On Wed, Apr 29, 2015 at 4:05 PM, simona bellavista <afylot at gmail.com
> <mailto:afylot at gmail.com>> wrote:
>>
>> I work on two distinct scientific clusters. I have run the same python
> code on the two clusters and I have noticed that one is faster by an
> order of magnitude than the other (1min vs 10min, this is important
> because I run this function many times).
>>
>> I have investigated with a profiler and I have found that the cause of
> this is that (same code and same data) is the function numpy.array that
> is being called 10^5 times. On cluster A it takes 2 s in total, whereas
> on cluster B it takes ~6 min.  For what regards the other functions,
> they are generally faster on cluster A. I understand that the clusters
> are quite different, both as hardware and installed libraries. It
> strikes me that on this particular function the performance is so
> different. I would have though that this is due to a difference in the
> available memory, but actually by looking with `top` the memory seems to
> be used only at 0.1% on cluster B. In theory numpy is compiled with
> atlas on cluster B, and on cluster A it is not clear, because
> numpy.__config__.show() returns NOT AVAILABLE for anything.
>>
>> Does anybody has any insight on that, and if I can improve the
> performance on cluster B?
> 
> Check to see if you have the "Transparent Hugepages" (THP) Linux kernel
> feature enabled on each cluster. You may want to try turning it off. I
> have recently run into a problem with a large-memory multicore machine
> with THP for programs that had many large numpy.array() memory
> allocations. Usually, THP helps memory-hungry applications (you can
> Google for the reasons), but it does require defragmenting the memory
> space to get contiguous hugepages. The system can get into a state where
> the memory space is so fragmented such that trying to get each new
> hugepage requires a lot of extra work to create the contiguous memory
> regions. In my case, a perfectly well-performing program would suddenly
> slow down immensely during it's memory-allocation-intensive actions.
> When I turned THP off, it started working normally again.
> 
> If you have root, try using `perf top` to see what C functions in user
> space and kernel space are taking up the most time in your process. If
> you see anything like `do_page_fault()`, this, or a similar issue, is
> your problem.
> 

this issue it has nothing to do with thp, its a change in array in numpy
1.9. Its now as fast as vstack, while before it was really really slow.

But the memory compaction is indeed awful, especially the backport
redhat did for their enterprise linux.

Typically it is enough to only disable the automatic defragmentation on
allocation only, not the full thps, e.g. via
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
(on redhat backports its a different path)

You still have the hugepaged running defrags at times of low load and in
limited fashion, you can also manually trigger a defrag by writting to:
/prog/sys/vm/compact_memory
Though the hugepaged which runs only occasionally should already do a
good job.