On Wed, Apr 29, 2015 at 4:05 PM, simona bellavista
I work on two distinct scientific clusters. I have run the same python
code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of
this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the
performance on cluster B? Check to see if you have the "Transparent Hugepages" (THP) Linux kernel feature enabled on each cluster. You may want to try turning it off. I have recently run into a problem with a large-memory multicore machine with THP for programs that had many large numpy.array() memory allocations. Usually, THP helps memory-hungry applications (you can Google for the reasons), but it does require defragmenting the memory space to get contiguous hugepages. The system can get into a state where the memory space is so fragmented such that trying to get each new hugepage requires a lot of extra work to create the contiguous memory regions. In my case, a perfectly well-performing program would suddenly slow down immensely during it's memory-allocation-intensive actions. When I turned THP off, it started working normally again. If you have root, try using `perf top` to see what C functions in user space and kernel space are taking up the most time in your process. If you see anything like `do_page_fault()`, this, or a similar issue, is your problem. -- Robert Kern