Comparing NumPy/IDL Performance
Hi all, Myself and several colleagues have recently started work on a Python library for solar physics http://www.sunpy.org/, in order to provide an alternative to the current mainstay for solar physicshttp://www.lmsal.com/solarsoft/, which is written in IDL. One of the first steps we have taken is to create a Python porthttps://github.com/sunpy/sunpy/blob/master/benchmarks/time_test3.pyof a popular benchmark for IDL (time_test3) which measures performance for a variety of (primarily matrix) operations. In our initial attempt, however, Python performs significantly poorer than IDL for several of the tests. I have attached a graph which shows the results for one machine: the x-axis is the test # being compared, and the y-axis is the time it took to complete the test, in milliseconds. While it is possible that this is simply due to limitations in Python/Numpy, I suspect that this is due at least in part to our lack in familiarity with NumPy and SciPy. So my question is, does anyone see any places where we are doing things very inefficiently in Python? In order to try and ensure a fair comparison between IDL and Python there are some things (e.g. the style of timing and output) which we have deliberately chosen to do a certain way. In other cases, however, it is likely that we just didn't know a better method. Any feedback or suggestions people have would be greatly appreciated. Unfortunately, due to the proprietary nature of IDL, we cannot share the original version of time_test3, but hopefully the comments in time_test3.py will be clear enough. Thanks! Keith
hi Keith, I do not think that your primary concern should be with this kind of speed test at this stage : 1/ rest assured that this sort of tests have been performed in other contexts, and you can always do some hard work on high level computing languages like IDL and python to improve performance 2/ "early optimization is the root of all evil" (Knuth?) 3/ I believe that your primary motivation is to provide an alternative library to a proprietary software. If this is so, then your effort is most welcome and I would suggest first to port an interesting but small piece of the IDL solar physics lib and then study the path to speed improvements on such a concrete use case. As for your python time_test3, if it is a benchmark code proprietary to the IDL codebas, there is no wonder it performs well there! :) At any rate, I would suggest simplifying your code with ipython : In [1]: import numpy as np In [2]: a = np.zeros([512, 512], dtype=np.uint8) In [3]: a[200:250, 200:250] = 10 In [4]: from scipy import ndimage In [5]: %timeit ndimage.filters.median_filter(a, size=(5, 5)) 10 loops, best of 3: 98 ms per loop I am not sure what unit is your vertical axis.... best, Johann On 09/26/2011 04:19 PM, Keith Hughitt wrote:
Hi all,
Myself and several colleagues have recently started work on a Python library for solar physics http://www.sunpy.org/, in order to provide an alternative to the current mainstay for solar physics http://www.lmsal.com/solarsoft/, which is written in IDL.
One of the first steps we have taken is to create a Python port https://github.com/sunpy/sunpy/blob/master/benchmarks/time_test3.py of a popular benchmark for IDL (time_test3) which measures performance for a variety of (primarily matrix) operations. In our initial attempt, however, Python performs significantly poorer than IDL for several of the tests. I have attached a graph which shows the results for one machine: the x-axis is the test # being compared, and the y-axis is the time it took to complete the test, in milliseconds. While it is possible that this is simply due to limitations in Python/Numpy, I suspect that this is due at least in part to our lack in familiarity with NumPy and SciPy.
So my question is, does anyone see any places where we are doing things very inefficiently in Python?
In order to try and ensure a fair comparison between IDL and Python there are some things (e.g. the style of timing and output) which we have deliberately chosen to do a certain way. In other cases, however, it is likely that we just didn't know a better method.
Any feedback or suggestions people have would be greatly appreciated. Unfortunately, due to the proprietary nature of IDL, we cannot share the original version of time_test3, but hopefully the comments in time_test3.py will be clear enough.
Thanks! Keith
-- This message has been scanned for viruses and dangerous content by *MailScanner* http://www.mailscanner.info/, and is believed to be clean.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Sep 26, 2011 at 3:19 PM, Keith Hughitt
Hi all, Myself and several colleagues have recently started work on a Python library for solar physics, in order to provide an alternative to the current mainstay for solar physics, which is written in IDL. One of the first steps we have taken is to create a Python port of a popular benchmark for IDL (time_test3) which measures performance for a variety of (primarily matrix) operations. In our initial attempt, however, Python performs significantly poorer than IDL for several of the tests. I have attached a graph which shows the results for one machine: the x-axis is the test # being compared, and the y-axis is the time it took to complete the test, in milliseconds. While it is possible that this is simply due to limitations in Python/Numpy, I suspect that this is due at least in part to our lack in familiarity with NumPy and SciPy.
So my question is, does anyone see any places where we are doing things very inefficiently in Python?
Looking at the plot there are five stand out tests, 1,2,3, 6 and 21. Tests 1, 2 and 3 are testing Python itself (no numpy or scipy), but are things you should be avoiding when using numpy anyway (don't use loops, use vectorised calculations etc). This is test 6, #Test 6 - Shift 512 by 512 byte and store nrep = 300 * scale_factor for i in range(nrep): c = np.roll(np.roll(b, 10, axis=0), 10, axis=1) #pylint: disable=W0612 timer.log('Shift 512 by 512 byte and store, %d times.' % nrep) The precise contents of b are determined by the previous tests (is that deliberate - it makes testing it in isolation hard). I'm unsure what you are trying to do and if it is the best way. This is test 21, which is just calling a scipy function repeatedly. Questions about this might be better directed to the scipy mailing list - also check what version of SciPy etc you have. n = 2**(17 * scale_factor) a = np.arange(n, dtype=np.float32) ... #Test 21 - Smooth 512 by 512 byte array, 5x5 boxcar for i in range(nrep): b = scipy.ndimage.filters.median_filter(a, size=(5, 5)) timer.log('Smooth 512 by 512 byte array, 5x5 boxcar, %d times' % nrep) After than, tests 10, 15 and 18 stand out. Test 10 is another use of roll, so whatever advice you get on test 6 may apply. Test 10: #Test 10 - Shift 512 x 512 array nrep = 60 * scale_factor for i in range(nrep): c = np.roll(np.roll(b, 10, axis=0), 10, axis=1) #for i in range(nrep): c = d.rotate( timer.log('Shift 512 x 512 array, %d times' % nrep) Test 15 is a loop based version of 16, where Python wins. Test 18 is a loop based version of 19 (log), where the difference is small. So in terms of numpy speed, your question just seems to be about numpy.roll and how else one might achieve this result? Peter
On Mon, Sep 26, 2011 at 8:19 AM, Keith Hughitt
Hi all,
Myself and several colleagues have recently started work on a Python library for solar physics http://www.sunpy.org/, in order to provide an alternative to the current mainstay for solar physicshttp://www.lmsal.com/solarsoft/, which is written in IDL.
One of the first steps we have taken is to create a Python porthttps://github.com/sunpy/sunpy/blob/master/benchmarks/time_test3.pyof a popular benchmark for IDL (time_test3) which measures performance for a variety of (primarily matrix) operations. In our initial attempt, however, Python performs significantly poorer than IDL for several of the tests. I have attached a graph which shows the results for one machine: the x-axis is the test # being compared, and the y-axis is the time it took to complete the test, in milliseconds. While it is possible that this is simply due to limitations in Python/Numpy, I suspect that this is due at least in part to our lack in familiarity with NumPy and SciPy.
So my question is, does anyone see any places where we are doing things very inefficiently in Python?
In order to try and ensure a fair comparison between IDL and Python there are some things (e.g. the style of timing and output) which we have deliberately chosen to do a certain way. In other cases, however, it is likely that we just didn't know a better method.
Any feedback or suggestions people have would be greatly appreciated. Unfortunately, due to the proprietary nature of IDL, we cannot share the original version of time_test3, but hopefully the comments in time_test3.py will be clear enough.
The first three tests are of Python loops over python lists, so I'm not much surprised at the results. Number 6 uses numpy roll, which is not implemented in a particularly efficient way, so could use some improvement. I haven't looked at the rest of the results, but I suspect they are similar. So in some cases I think the benchmark isn't particularly useful, but in a few others numpy could be improved. It would be interesting to see which features are actually widely used in IDL code and weight them accordingly. In general, for loops are to be avoided, but if some numpy routine is a bottleneck we should fix it. Chuck
Hello Keith, While I also echo Johann's points about the arbitrariness and non-utility of benchmarking I'll briefly comment just on just a few tests to help out with getting things into idiomatic python/numpy: Tests 1 and 2 are fairly pointless (empty for loop and empty procedure) that won't actually influence the running time of well-written non-pathological code. Test 3: #Test 3 - Add 200000 scalar ints nrep = 2000000 * scale_factor for i in range(nrep): a = i + 1 well, python looping is slow... one doesn't do such loops in idiomatic code if the underlying intent can be re-cast into array operations in numpy. But here the test is on such a simple operation that it's not clear how to recast in a way that would remain reasonable. Ideally you'd test something like: i = numpy.arange(200000) for j in range(scale_factor): a = i + 1 but that sort of changes what the test is testing. Finally, test 21: #Test 21 - Smooth 512 by 512 byte array, 5x5 boxcar for i in range(nrep): b = scipy.ndimage.filters.median_filter(a, size=(5, 5)) timer.log('Smooth 512 by 512 byte array, 5x5 boxcar, %d times' % nrep) A median filter is definitely NOT a boxcar filter! You want "uniform_filter": In [4]: a = numpy.empty((1000,1000)) In [5]: timeit scipy.ndimage.filters.median_filter(a, size=(5, 5)) 10 loops, best of 3: 93.2 ms per loop In [6]: timeit scipy.ndimage.filters.uniform_filter(a, size=(5, 5)) 10 loops, best of 3: 27.7 ms per loop Zach On Sep 26, 2011, at 10:19 AM, Keith Hughitt wrote:
Hi all,
Myself and several colleagues have recently started work on a Python library for solar physics, in order to provide an alternative to the current mainstay for solar physics, which is written in IDL.
One of the first steps we have taken is to create a Python port of a popular benchmark for IDL (time_test3) which measures performance for a variety of (primarily matrix) operations. In our initial attempt, however, Python performs significantly poorer than IDL for several of the tests. I have attached a graph which shows the results for one machine: the x-axis is the test # being compared, and the y-axis is the time it took to complete the test, in milliseconds. While it is possible that this is simply due to limitations in Python/Numpy, I suspect that this is due at least in part to our lack in familiarity with NumPy and SciPy.
So my question is, does anyone see any places where we are doing things very inefficiently in Python?
In order to try and ensure a fair comparison between IDL and Python there are some things (e.g. the style of timing and output) which we have deliberately chosen to do a certain way. In other cases, however, it is likely that we just didn't know a better method.
Any feedback or suggestions people have would be greatly appreciated. Unfortunately, due to the proprietary nature of IDL, we cannot share the original version of time_test3, but hopefully the comments in time_test3.py will be clear enough.
Thanks! Keith
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Sep 26, 2011 at 8:24 AM, Zachary Pincus
Test 3: #Test 3 - Add 200000 scalar ints nrep = 2000000 * scale_factor for i in range(nrep): a = i + 1
well, python looping is slow... one doesn't do such loops in idiomatic code if the underlying intent can be re-cast into array operations in numpy.
Also, in this particular case, what you're mostly measuring is how much time it takes to allocate a giant list of integers by calling 'range'. Using 'xrange' instead speeds things up by a factor of two: def f(): nrep = 2000000 for i in range(nrep): a = i + 1 def g(): nrep = 2000000 for i in xrange(nrep): a = i + 1 In [8]: timeit f() 10 loops, best of 3: 138 ms per loop In [9]: timeit g() 10 loops, best of 3: 72.1 ms per loop Usually I don't worry about the difference between xrange and range -- it doesn't really matter for small loops or loops that are doing more work inside each iteration -- and that's every loop I actually write in practice :-). And if I really did need to write a loop like this (lots of iterations with a small amount of work in each and speed is critical) then I'd use cython. But, you might as well get in the habit of using 'xrange'; it won't hurt and occasionally will help. -- Nathaniel
One minor thing is you should use xrange rather than range. Although it will
probably only make a difference for the empty loop ;)
Otherwise, from what I can see, tests where numpy is really much worse are:
- 1, 2, 3, 15, 18: Not numpy but Python related: for loops are not efficient
- 6, 10: Maybe numpy.roll is indeed not efficiently implemented
- 21: Same for this scipy function
-=- Olivier
2011/9/26 Keith Hughitt
Hi all,
Myself and several colleagues have recently started work on a Python library for solar physics http://www.sunpy.org/, in order to provide an alternative to the current mainstay for solar physicshttp://www.lmsal.com/solarsoft/, which is written in IDL.
One of the first steps we have taken is to create a Python porthttps://github.com/sunpy/sunpy/blob/master/benchmarks/time_test3.pyof a popular benchmark for IDL (time_test3) which measures performance for a variety of (primarily matrix) operations. In our initial attempt, however, Python performs significantly poorer than IDL for several of the tests. I have attached a graph which shows the results for one machine: the x-axis is the test # being compared, and the y-axis is the time it took to complete the test, in milliseconds. While it is possible that this is simply due to limitations in Python/Numpy, I suspect that this is due at least in part to our lack in familiarity with NumPy and SciPy.
So my question is, does anyone see any places where we are doing things very inefficiently in Python?
In order to try and ensure a fair comparison between IDL and Python there are some things (e.g. the style of timing and output) which we have deliberately chosen to do a certain way. In other cases, however, it is likely that we just didn't know a better method.
Any feedback or suggestions people have would be greatly appreciated. Unfortunately, due to the proprietary nature of IDL, we cannot share the original version of time_test3, but hopefully the comments in time_test3.py will be clear enough.
Thanks! Keith
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Thank you all for the comments and suggestions. First off, I would like to say that I entirely agree with people's suggestions about lack of objectiveness in the test design, and the caveat about optimizing early. The main reason we put together the Python version of the benchmark was as a quick "sanity check" to make sure that there are no major show-stoppers before we began work on the library. We also wanted to put together something to show other people who are firmly in the IDL camp that this is a viable option. We did in fact put together another short test-suite (test_testr.py & time_testr.pro https://github.com/sunpy/sunpy/tree/master/benchmarks) which consists of operations that would are frequently used by us, but it also is testing a very small portion of the kinds of things our library will eventually do. That said, I made a few small changes to the original benchmark, based on people's feedback, and put together a new plot. The changes made include: 1. Using xrange instead of range 2. Using uniform filter instead of median filter 3. Fixed a typo for tests 2 & 3 which resulted in slower Python results Again, note that some of the tests are testing non-numpy functionality. Several of the results still stand out, but overall the results are much more reasonable than before. Cheers, Keith
I think the remaining delta between the integer and float "boxcar" smoothing is that the integer version (test 21) still uses median_filter(), while the float one (test 22) is using uniform_filter(), which is a boxcar. Other than that and the slow roll() implementation in numpy, things look pretty solid, yes? Zach On Sep 29, 2011, at 12:11 PM, Keith Hughitt wrote:
Thank you all for the comments and suggestions.
First off, I would like to say that I entirely agree with people's suggestions about lack of objectiveness in the test design, and the caveat about optimizing early. The main reason we put together the Python version of the benchmark was as a quick "sanity check" to make sure that there are no major show-stoppers before we began work on the library. We also wanted to put together something to show other people who are firmly in the IDL camp that this is a viable option.
We did in fact put together another short test-suite (test_testr.py & time_testr.pro) which consists of operations that would are frequently used by us, but it also is testing a very small portion of the kinds of things our library will eventually do.
That said, I made a few small changes to the original benchmark, based on people's feedback, and put together a new plot.
The changes made include:
1. Using xrange instead of range 2. Using uniform filter instead of median filter 3. Fixed a typo for tests 2 & 3 which resulted in slower Python results
Again, note that some of the tests are testing non-numpy functionality. Several of the results still stand out, but overall the results are much more reasonable than before.
Cheers, Keith
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Ah. Thanks for catching that!
Otherwise though I think everything looks pretty good.
Thanks all,
Keith
On Thu, Sep 29, 2011 at 12:18 PM, Zachary Pincus
I think the remaining delta between the integer and float "boxcar" smoothing is that the integer version (test 21) still uses median_filter(), while the float one (test 22) is using uniform_filter(), which is a boxcar.
Other than that and the slow roll() implementation in numpy, things look pretty solid, yes?
Zach
Just want to point to some excellent material that was recently presented at
the course Advanced Scientific Programming in
Pythonhttps://python.g-node.org/wiki/at St Andrews. Day 3 was titled
"The Quest for Speed" (see
https://python.g-node.org/wiki/schedule) and might interest you as well.
Regards,
David
On 29 September 2011 20:46, Keith Hughitt
Ah. Thanks for catching that!
Otherwise though I think everything looks pretty good.
Thanks all, Keith
On Thu, Sep 29, 2011 at 12:18 PM, Zachary Pincus
wrote: I think the remaining delta between the integer and float "boxcar" smoothing is that the integer version (test 21) still uses median_filter(), while the float one (test 22) is using uniform_filter(), which is a boxcar.
Other than that and the slow roll() implementation in numpy, things look pretty solid, yes?
Zach
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (8)
-
Charles R Harris
-
David Verelst
-
Johann Cohen-Tanugi
-
Keith Hughitt
-
Nathaniel Smith
-
Olivier Delalleau
-
Peter
-
Zachary Pincus