numpy speed question
Hello all,
I have a little question about the speed of numpy vs IDL 7.0. I did a very simple little check by computing just a cosine in a loop. I was quite surprised to see an order of magnitude of difference between numpy and IDL, I would have thought that for such a basic function, the speed would be approximatively the same.
I suppose that some of the difference may come from the default data type of 64bits in numpy and 32 bits in IDL. Is there a way to change the numpy default data type (without recompiling) ?
And I'm not an expert at all, maybe there is a better explanation, like a better use of the several CPU core by IDL ?
I'm working with windows 7 64 bits on a core i7.
any hint is welcome. Thanks.
Here the IDL code : Julian1 = SYSTIME( /JULIAN , /UTC ) for j=0,9999 do begin for i=0,999 do begin a=cos(2*!pi*i/100.) endfor endfor Julian2 = SYSTIME( /JULIAN , /UTC ) print, (Julian2Julian1)*86400.0 print,cpt end
result: % Compiled module: $MAIN$. 2.9999837
The python code: from numpy import * from time import time time1 = time() for j in range(10000): for i in range(1000): a=cos(2*pi*i/100.) time2 = time() print time2time1
result: In [2]: run python_test_speed.py 24.1809999943
using math.cos instead of numpy.cos should be much faster. I believe this is a known issue of numpy.
On Thu, Nov 25, 2010 at 11:13 AM, JeanLuc Menut jeanluc.menut@free.fr wrote:
Hello all,
I have a little question about the speed of numpy vs IDL 7.0. I did a very simple little check by computing just a cosine in a loop. I was quite surprised to see an order of magnitude of difference between numpy and IDL, I would have thought that for such a basic function, the speed would be approximatively the same.
I suppose that some of the difference may come from the default data type of 64bits in numpy and 32 bits in IDL. Is there a way to change the numpy default data type (without recompiling) ?
And I'm not an expert at all, maybe there is a better explanation, like a better use of the several CPU core by IDL ?
I'm working with windows 7 64 bits on a core i7.
any hint is welcome. Thanks.
Here the IDL code : Julian1 = SYSTIME( /JULIAN , /UTC ) for j=0,9999 do begin for i=0,999 do begin a=cos(2*!pi*i/100.) endfor endfor Julian2 = SYSTIME( /JULIAN , /UTC ) print, (Julian2Julian1)*86400.0 print,cpt end
result: % Compiled module: $MAIN$. 2.9999837
The python code: from numpy import * from time import time time1 = time() for j in range(10000): for i in range(1000): a=cos(2*pi*i/100.) time2 = time() print time2time1
result: In [2]: run python_test_speed.py 24.1809999943
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
JeanLuc Menut <jeanluc.menut <at> free.fr> writes:
I have a little question about the speed of numpy vs IDL 7.0.
Here the IDL result: % Compiled module: $MAIN$. 2.9999837
The python code: from numpy import * from time import time time1 = time() for j in range(10000): for i in range(1000): a=cos(2*pi*i/100.) time2 = time() print time2time1
result: In [2]: run python_test_speed.py 24.1809999943
Whilst you've imported everything from numpy you're not really using numpy  you're still using a slow Python (double) loop. The power of numpy comes from vectorising your code  i.e. applying functions to arrays of data.
The example below demonstrates an 80 fold increase in speed by vectorising the calculation:
def method1(): a = empty([1000, 10000]) for j in range(10000): for i in range(1000): a[i,j] = cos(2*pi*i/100.) return a #
def method2(): ij = np.repeat((2*pi*np.arange(1000)/100.)[:,None], 10000, axis=1) return np.cos(ij) #
In [46]: timeit method1() 1 loops, best of 3: 47.9 s per loop
In [47]: timeit method2() 1 loops, best of 3: 589 ms per loop
In [48]: allclose(method1(), method2()) Out[48]: True
Hi,
25/11/10 @ 11:13 (+0100), thus spake JeanLuc Menut:
I suppose that some of the difference may come from the default data type of 64bits in numpy and 32 bits in IDL. Is there a way to change the numpy default data type (without recompiling) ?
This is probably not the issue.
And I'm not an expert at all, maybe there is a better explanation, like a better use of the several CPU core by IDL ?
I'm not an expert either, but the basic idea you have to get is that "for" loops in Python are slow. Numpy is not going to change this. Instead, Numpy allows you to work with "vectors" and "arrays" so that you need not putting loops in your code. So, you have to change the way you think about things, it takes a little to get used to it at first.
Cheers,
Le 25/11/2010 11:51, Ernest Adrogué a écrit :
I'm not an expert either, but the basic idea you have to get is that "for" loops in Python are slow. Numpy is not going to change this. Instead, Numpy allows you to work with "vectors" and "arrays" so that you need not putting loops in your code. So, you have to change the way you think about things, it takes a little to get used to it at first.
Yes I know but IDL share this characteristics with numpy, and sometimes you cannot avoid loop. Anyway it was just a test to compare the speed of the cosine function in IDL and numpy.
On 11/25/2010 5:55 AM, JeanLuc Menut wrote:
it was just a test to compare the speed of the cosine function in IDL and numpy
The point others are trying to make is that you *instead* tested the speed of creation of a certain object type. To test the *function* speeds, feed both large arrays.
>>> type(0.5) <type 'float'> >>> type(math.cos(0.5)) <type 'float'> >>> type(np.cos(0.5)) <type 'numpy.float64'>
hth, Alan Isaac
On Thu, Nov 25, 2010 at 7:55 PM, JeanLuc Menut jeanluc.menut@free.fr wrote:
Yes I know but IDL share this characteristics with numpy, and sometimes you cannot avoid loop. Anyway it was just a test to compare the speed of the cosine function in IDL and numpy.
No, you compared IDL looping and python looping. You did not even use numpy. Loops are slow in python, and will remain so in the near future. OTOH, there are many ways to deal with this issue in python compared to IDL (cython being a fairly popular one).
David
On Thu, Nov 25, 2010 at 4:13 AM, JeanLuc Menut jeanluc.menut@free.fr wrote:
Hello all,
I have a little question about the speed of numpy vs IDL 7.0. I did a very simple little check by computing just a cosine in a loop. I was quite surprised to see an order of magnitude of difference between numpy and IDL, I would have thought that for such a basic function, the speed would be approximatively the same.
I suppose that some of the difference may come from the default data type of 64bits in numpy and 32 bits in IDL. Is there a way to change the numpy default data type (without recompiling) ?
And I'm not an expert at all, maybe there is a better explanation, like a better use of the several CPU core by IDL ?
I'm working with windows 7 64 bits on a core i7.
any hint is welcome. Thanks.
Here the IDL code : Julian1 = SYSTIME( /JULIAN , /UTC ) for j=0,9999 do begin for i=0,999 do begin a=cos(2*!pi*i/100.) endfor endfor Julian2 = SYSTIME( /JULIAN , /UTC ) print, (Julian2Julian1)*86400.0 print,cpt end
result: % Compiled module: $MAIN$. 2.9999837
The python code: from numpy import * from time import time time1 = time() for j in range(10000): for i in range(1000): a=cos(2*pi*i/100.) time2 = time() print time2time1
result: In [2]: run python_test_speed.py 24.1809999943
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Vectorised numpy version already blow away the results.
Here is what I get using the IDL version (with IDL v7.1):
IDL> .r test_idl % Compiled module: $MAIN$. 4.0000185
I[10]: time run test_python 43.305727005
and using a Cythonized version:
from math import pi
cdef extern from "math.h": float cos(float)
cpdef float myloop(int n1, int n2, float n3): cdef float a cdef int i, j for j in range(n1): for i in range(n2): a=cos(2*pi*i/n3)
compiling the setup.py file python setup.py build_ext inplace and importing the function into IPython
from mycython import myloop
I[6]: timeit myloop(10000, 1000, 100.0) 1 loops, best of 3: 2.91 s per loop
Although this was mentioned earlier, it's worth emphasizing that if you need to use functions such as cosine with scalar arguments, you should use math.cos(), not numpy.cos(). The numpy versions of these functions are optimized for handling array arguments and are much slower than the math versions for scalar arguments.
Bruce Sherwood
On Thu, Nov 25, 2010 at 2:34 PM, Gökhan Sever gokhansever@gmail.com wrote:
On Thu, Nov 25, 2010 at 4:13 AM, JeanLuc Menut jeanluc.menut@free.fr wrote:
Hello all,
I have a little question about the speed of numpy vs IDL 7.0. I did a very simple little check by computing just a cosine in a loop. I was quite surprised to see an order of magnitude of difference between numpy and IDL, I would have thought that for such a basic function, the speed would be approximatively the same.
I suppose that some of the difference may come from the default data type of 64bits in numpy and 32 bits in IDL. Is there a way to change the numpy default data type (without recompiling) ?
And I'm not an expert at all, maybe there is a better explanation, like a better use of the several CPU core by IDL ?
I'm working with windows 7 64 bits on a core i7.
any hint is welcome. Thanks.
Here the IDL code : Julian1 = SYSTIME( /JULIAN , /UTC ) for j=0,9999 do begin for i=0,999 do begin a=cos(2*!pi*i/100.) endfor endfor Julian2 = SYSTIME( /JULIAN , /UTC ) print, (Julian2Julian1)*86400.0 print,cpt end
result: % Compiled module: $MAIN$. 2.9999837
The python code: from numpy import * from time import time time1 = time() for j in range(10000): for i in range(1000): a=cos(2*pi*i/100.) time2 = time() print time2time1
result: In [2]: run python_test_speed.py 24.1809999943
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Vectorised numpy version already blow away the results.
Here is what I get using the IDL version (with IDL v7.1):
IDL> .r test_idl % Compiled module: $MAIN$. 4.0000185
I[10]: time run test_python 43.305727005
and using a Cythonized version:
from math import pi
cdef extern from "math.h": float cos(float)
cpdef float myloop(int n1, int n2, float n3): cdef float a cdef int i, j for j in range(n1): for i in range(n2): a=cos(2*pi*i/n3)
compiling the setup.py file python setup.py build_ext inplace and importing the function into IPython
from mycython import myloop
I[6]: timeit myloop(10000, 1000, 100.0) 1 loops, best of 3: 2.91 s per loop
 Gökhan _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Le 26/11/2010 17:48, Bruce Sherwood a écrit :
Although this was mentioned earlier, it's worth emphasizing that if you need to use functions such as cosine with scalar arguments, you should use math.cos(), not numpy.cos(). The numpy versions of these functions are optimized for handling array arguments and are much slower than the math versions for scalar arguments.
Yes I understand that. I just want to stress that it was not a benchmark (nor a critic) but a test to know if it was interesting to translate directly an IDL code into python/numpy before trying to optimize it (I know more python than IDL). I expected to have approximatively the same speed for both, was surprised by the result, and wanted to know if there was an obvious reason besides the unoptimization for scalars.
A Thursday 25 November 2010 11:13:49 JeanLuc Menut escrigué:
Hello all,
I have a little question about the speed of numpy vs IDL 7.0. I did a very simple little check by computing just a cosine in a loop. I was quite surprised to see an order of magnitude of difference between numpy and IDL, I would have thought that for such a basic function, the speed would be approximatively the same.
I suppose that some of the difference may come from the default data type of 64bits in numpy and 32 bits in IDL. Is there a way to change the numpy default data type (without recompiling) ?
And I'm not an expert at all, maybe there is a better explanation, like a better use of the several CPU core by IDL ?
As others have already point out, you should make sure that you use numpy.cos with arrays in order to get good performance.
I don't know whether IDL is using multicores or not, but if you are looking for ultimate performance, you can always use Numexpr that makes use of multicores. For example, using a machine with 8 cores (w/ hyperthreading), we have:
from math import pi import numpy as np import numexpr as ne i = np.arange(1e6) %timeit np.cos(2*pi*i/100.)
10 loops, best of 3: 85.2 ms per loop
%timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 8.28 ms per loop
If you don't have a machine with a lot of cores, but still want to get good performance, you can still link Numexpr against Intel's VML (Vector Math Library). For example, using Numexpr+VML with only one core (in another machine):
%timeit np.cos(2*pi*i/100.)
10 loops, best of 3: 66.7 ms per loop
ne.set_vml_num_threads(1) %timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 9.1 ms per loop
which also gives a pretty good speedup. Curiously, Numexpr+VML is not that good at using multicores in this case:
ne.set_vml_num_threads(2) %timeit ne.evaluate("cos(2*pi*i/100.)")
10 loops, best of 3: 14.7 ms per loop
I don't really know why Numexpr+VML is taking more time using 2 threads than only one, but it is probably due to Numexpr requiring better fine tuning in combination with VML :/
participants (9)

Alan G Isaac

Bruce Sherwood

Dave Hirschfeld

David Cournapeau

Ernest Adrogué

Francesc Alted

Gökhan Sever

JeanLuc Menut

Sebastian Walter