
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times). I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything. Does anybody has any insight on that, and if I can improve the performance on cluster B?

Compile it yourself to know the limitations/benefits of the dependency libraries. Otherwise, have you checked which versions of numpy they are, i.e. are they the same version? 2015-04-29 17:05 GMT+02:00 simona bellavista <afylot@gmail.com>:
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the performance on cluster B?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick

on cluster A 1.9.0 and on cluster B 1.8.2 2015-04-29 17:18 GMT+02:00 Nick Papior Andersen <nickpapior@gmail.com>:
Compile it yourself to know the limitations/benefits of the dependency libraries.
Otherwise, have you checked which versions of numpy they are, i.e. are they the same version?
2015-04-29 17:05 GMT+02:00 simona bellavista <afylot@gmail.com>:
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the performance on cluster B?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

You could try and install your own numpy to check whether that resolves the problem. 2015-04-29 17:40 GMT+02:00 simona bellavista <afylot@gmail.com>:
on cluster A 1.9.0 and on cluster B 1.8.2
2015-04-29 17:18 GMT+02:00 Nick Papior Andersen <nickpapior@gmail.com>:
Compile it yourself to know the limitations/benefits of the dependency libraries.
Otherwise, have you checked which versions of numpy they are, i.e. are they the same version?
2015-04-29 17:05 GMT+02:00 simona bellavista <afylot@gmail.com>:
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the performance on cluster B?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick

There was a major improvement to np.array in some cases. You can probably work around this by using np.concatenate instead of np.array in your case (depends on the usecase, but I will guess you have code doing: np.array([arr1, arr2, arr3]) or similar. If your use case is different, you may be out of luck and only an upgrade would help. On Mi, 2015-04-29 at 17:41 +0200, Nick Papior Andersen wrote:
You could try and install your own numpy to check whether that resolves the problem.
2015-04-29 17:40 GMT+02:00 simona bellavista <afylot@gmail.com>: on cluster A 1.9.0 and on cluster B 1.8.2
2015-04-29 17:18 GMT+02:00 Nick Papior Andersen <nickpapior@gmail.com>: Compile it yourself to know the limitations/benefits of the dependency libraries.
Otherwise, have you checked which versions of numpy they are, i.e. are they the same version?
2015-04-29 17:05 GMT+02:00 simona bellavista <afylot@gmail.com>:
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the performance on cluster B?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

I have seen a big improvement in performance with numpy 1.9.2 with python 2.7.8, numpy.array takes 5 s instead of 300s. On the other side, I have also tried numpy 1.9.2 and 1.9.0 with python 3.4 and the results are terrible: numpy.array takes 20s, but the other routines are slowed down, for example concatenate and astype and copy and uniform. Most of all, the sort function of numpy.dnarray is slowed down by a factor at least 10. On the other cluster I am using python 3.3 with numpy 1.9.0 and it is working very well (but I think it is so also because of the hardware). I was trying to install python 3.3 on this cluster, but because of other issues (error at compile time of h5py library and bug at runtime in the dill library) I cannot test it right now. 2015-04-29 17:47 GMT+02:00 Sebastian Berg <sebastian@sipsolutions.net>:
There was a major improvement to np.array in some cases.
You can probably work around this by using np.concatenate instead of np.array in your case (depends on the usecase, but I will guess you have code doing:
np.array([arr1, arr2, arr3])
or similar. If your use case is different, you may be out of luck and only an upgrade would help.
On Mi, 2015-04-29 at 17:41 +0200, Nick Papior Andersen wrote:
You could try and install your own numpy to check whether that resolves the problem.
2015-04-29 17:40 GMT+02:00 simona bellavista <afylot@gmail.com>: on cluster A 1.9.0 and on cluster B 1.8.2
2015-04-29 17:18 GMT+02:00 Nick Papior Andersen <nickpapior@gmail.com>: Compile it yourself to know the limitations/benefits of the dependency libraries.
Otherwise, have you checked which versions of numpy they are, i.e. are they the same version?
2015-04-29 17:05 GMT+02:00 simona bellavista <afylot@gmail.com>:
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the performance on cluster B?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

I have had good luck with Continuum's Miniconda Python distributions on Linux. http://conda.pydata.org/miniconda.html The `conda` command makes it very easy to create specific testing environments for Python 2 and 3 with many different packages. Everything is precompiled, so you won't have to worry about system library differences between the two clusters. Hope that helps. Ryan On Thu, Apr 30, 2015 at 10:03 AM, simona bellavista <afylot@gmail.com> wrote:
I have seen a big improvement in performance with numpy 1.9.2 with python 2.7.8, numpy.array takes 5 s instead of 300s.
On the other side, I have also tried numpy 1.9.2 and 1.9.0 with python 3.4 and the results are terrible: numpy.array takes 20s, but the other routines are slowed down, for example concatenate and astype and copy and uniform. Most of all, the sort function of numpy.dnarray is slowed down by a factor at least 10.
On the other cluster I am using python 3.3 with numpy 1.9.0 and it is working very well (but I think it is so also because of the hardware). I was trying to install python 3.3 on this cluster, but because of other issues (error at compile time of h5py library and bug at runtime in the dill library) I cannot test it right now.
2015-04-29 17:47 GMT+02:00 Sebastian Berg <sebastian@sipsolutions.net>:
There was a major improvement to np.array in some cases.
You can probably work around this by using np.concatenate instead of np.array in your case (depends on the usecase, but I will guess you have code doing:
np.array([arr1, arr2, arr3])
or similar. If your use case is different, you may be out of luck and only an upgrade would help.
On Mi, 2015-04-29 at 17:41 +0200, Nick Papior Andersen wrote:
You could try and install your own numpy to check whether that resolves the problem.
2015-04-29 17:40 GMT+02:00 simona bellavista <afylot@gmail.com>: on cluster A 1.9.0 and on cluster B 1.8.2
2015-04-29 17:18 GMT+02:00 Nick Papior Andersen <nickpapior@gmail.com>: Compile it yourself to know the limitations/benefits of the dependency libraries.
Otherwise, have you checked which versions of numpy they are, i.e. are they the same version?
2015-04-29 17:05 GMT+02:00 simona bellavista <afylot@gmail.com>:
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the performance on cluster B?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

numpy 1.9 makes array(list) performance similar in performance to vstack in 1.8 its very slow. On 29.04.2015 17:40, simona bellavista wrote:
on cluster A 1.9.0 and on cluster B 1.8.2
2015-04-29 17:18 GMT+02:00 Nick Papior Andersen <nickpapior@gmail.com <mailto:nickpapior@gmail.com>>:
Compile it yourself to know the limitations/benefits of the dependency libraries.
Otherwise, have you checked which versions of numpy they are, i.e. are they the same version?
2015-04-29 17:05 GMT+02:00 simona bellavista <afylot@gmail.com <mailto:afylot@gmail.com>>:
I work on two distinct scientific clusters. I have run the same python code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the performance on cluster B?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Kind regards Nick
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, Apr 29, 2015 at 4:05 PM, simona bellavista <afylot@gmail.com> wrote:
I work on two distinct scientific clusters. I have run the same python
code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of
this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the
performance on cluster B? Check to see if you have the "Transparent Hugepages" (THP) Linux kernel feature enabled on each cluster. You may want to try turning it off. I have recently run into a problem with a large-memory multicore machine with THP for programs that had many large numpy.array() memory allocations. Usually, THP helps memory-hungry applications (you can Google for the reasons), but it does require defragmenting the memory space to get contiguous hugepages. The system can get into a state where the memory space is so fragmented such that trying to get each new hugepage requires a lot of extra work to create the contiguous memory regions. In my case, a perfectly well-performing program would suddenly slow down immensely during it's memory-allocation-intensive actions. When I turned THP off, it started working normally again. If you have root, try using `perf top` to see what C functions in user space and kernel space are taking up the most time in your process. If you see anything like `do_page_fault()`, this, or a similar issue, is your problem. -- Robert Kern

On 29.04.2015 17:50, Robert Kern wrote:
On Wed, Apr 29, 2015 at 4:05 PM, simona bellavista <afylot@gmail.com <mailto:afylot@gmail.com>> wrote:
I work on two distinct scientific clusters. I have run the same python
code on the two clusters and I have noticed that one is faster by an order of magnitude than the other (1min vs 10min, this is important because I run this function many times).
I have investigated with a profiler and I have found that the cause of
this is that (same code and same data) is the function numpy.array that is being called 10^5 times. On cluster A it takes 2 s in total, whereas on cluster B it takes ~6 min. For what regards the other functions, they are generally faster on cluster A. I understand that the clusters are quite different, both as hardware and installed libraries. It strikes me that on this particular function the performance is so different. I would have though that this is due to a difference in the available memory, but actually by looking with `top` the memory seems to be used only at 0.1% on cluster B. In theory numpy is compiled with atlas on cluster B, and on cluster A it is not clear, because numpy.__config__.show() returns NOT AVAILABLE for anything.
Does anybody has any insight on that, and if I can improve the
performance on cluster B?
Check to see if you have the "Transparent Hugepages" (THP) Linux kernel feature enabled on each cluster. You may want to try turning it off. I have recently run into a problem with a large-memory multicore machine with THP for programs that had many large numpy.array() memory allocations. Usually, THP helps memory-hungry applications (you can Google for the reasons), but it does require defragmenting the memory space to get contiguous hugepages. The system can get into a state where the memory space is so fragmented such that trying to get each new hugepage requires a lot of extra work to create the contiguous memory regions. In my case, a perfectly well-performing program would suddenly slow down immensely during it's memory-allocation-intensive actions. When I turned THP off, it started working normally again.
If you have root, try using `perf top` to see what C functions in user space and kernel space are taking up the most time in your process. If you see anything like `do_page_fault()`, this, or a similar issue, is your problem.
this issue it has nothing to do with thp, its a change in array in numpy 1.9. Its now as fast as vstack, while before it was really really slow. But the memory compaction is indeed awful, especially the backport redhat did for their enterprise linux. Typically it is enough to only disable the automatic defragmentation on allocation only, not the full thps, e.g. via echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag (on redhat backports its a different path) You still have the hugepaged running defrags at times of low load and in limited fashion, you can also manually trigger a defrag by writting to: /prog/sys/vm/compact_memory Though the hugepaged which runs only occasionally should already do a good job.
participants (6)
-
Julian Taylor
-
Nick Papior Andersen
-
Robert Kern
-
Ryan Nelson
-
Sebastian Berg
-
simona bellavista