Speed bottlenecks on simple tasks - suggested improvement
Hello, First a quick summary of my problem and at the end I include the basic changes I am suggesting to the source (they may benefit others) I am ages behind in times and I am still using Numeric in Python 2.2.3. The main reason why it has taken so long to upgrade is because NumPy kills performance on several of my tests. I am sorry if this topic has been discussed before. I tried parsing the mailing list and also google and all I found were comments related to the fact that such is life when you use NumPy for small arrays. In my case I have several thousands of lines of code where data structures rely heavily on Numeric arrays but it is unpredictable if the problem at hand will result in large or small arrays. Furthermore, once the vectorized operations complete, the values could be assigned into scalars and just do simple math or loops. I am fairly sure the core of my problems is that the 'float64' objects start propagating all over the program data structures (not in arrays) and they are considerably slower for just about everything when compared to the native python float. Conclusion, it is not practical for me to do a massive re-structuring of code to improve speed on simple things like "a[0] < 4" (assuming "a" is an array) which is about 10 times slower than "b < 4" (assuming "b" is a float) I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error . I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this, ret = PyObject_GetAttrString(obj, "__array_priority__"); and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__"); ( I think I found about 7 spots ) I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code. With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass went down from 0.8 seconds to 0.38 seconds. A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.]) went down from 8.5 seconds to 2.5 seconds. Altogether, these simple changes got me half way to the speed I used to get in Numeric and I could not see any slow down in any of my cases that benefit from heavy array manipulation. I am out of ideas on how to improve further though. Few questions: - Is there any interest for me to provide the exact details of the code I changed ? - I managed to compile NumPy through setup.py but I am not sure how to force it to generate pdb files from my Visual Studio Compiler. I need the pdb files such that I can run my profiler on NumPy. Anybody has any experience with this ? (Visual Studio) - The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes) I apologize again if this topic has already been discussed. Regards, Raul
On 12/2/2012 5:28 PM, Raul Cota wrote:
Hello,
First a quick summary of my problem and at the end I include the basic changes I am suggesting to the source (they may benefit others)
I am ages behind in times and I am still using Numeric in Python 2.2.3. The main reason why it has taken so long to upgrade is because NumPy kills performance on several of my tests.
I am sorry if this topic has been discussed before. I tried parsing the mailing list and also google and all I found were comments related to the fact that such is life when you use NumPy for small arrays.
In my case I have several thousands of lines of code where data structures rely heavily on Numeric arrays but it is unpredictable if the problem at hand will result in large or small arrays. Furthermore, once the vectorized operations complete, the values could be assigned into scalars and just do simple math or loops. I am fairly sure the core of my problems is that the 'float64' objects start propagating all over the program data structures (not in arrays) and they are considerably slower for just about everything when compared to the native python float.
Conclusion, it is not practical for me to do a massive re-structuring of code to improve speed on simple things like "a[0] < 4" (assuming "a" is an array) which is about 10 times slower than "b < 4" (assuming "b" is a float)
I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots )
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass
went down from 0.8 seconds to 0.38 seconds.
A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds.
Altogether, these simple changes got me half way to the speed I used to get in Numeric and I could not see any slow down in any of my cases that benefit from heavy array manipulation. I am out of ideas on how to improve further though.
Few questions: - Is there any interest for me to provide the exact details of the code I changed ?
- I managed to compile NumPy through setup.py but I am not sure how to force it to generate pdb files from my Visual Studio Compiler. I need the pdb files such that I can run my profiler on NumPy. Anybody has any experience with this ? (Visual Studio)
Change the compiler and linker flags in Python\Lib\distutils\msvc9compiler.py to: self.compile_options = ['/nologo', '/Ox', '/MD', '/W3', '/DNDEBUG', '/Zi'] self.ldflags_shared = ['/DLL', '/nologo', '/INCREMENTAL:YES', '/DEBUG'] Then rebuild numpy. Christoph
- The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
I apologize again if this topic has already been discussed.
Regards,
Raul
Thanks Christoph. It seemed to work. Will do profile runs today/tomorrow and see what come out. Raul On 02/12/2012 7:33 PM, Christoph Gohlke wrote:
On 12/2/2012 5:28 PM, Raul Cota wrote:
Hello,
First a quick summary of my problem and at the end I include the basic changes I am suggesting to the source (they may benefit others)
I am ages behind in times and I am still using Numeric in Python 2.2.3. The main reason why it has taken so long to upgrade is because NumPy kills performance on several of my tests.
I am sorry if this topic has been discussed before. I tried parsing the mailing list and also google and all I found were comments related to the fact that such is life when you use NumPy for small arrays.
In my case I have several thousands of lines of code where data structures rely heavily on Numeric arrays but it is unpredictable if the problem at hand will result in large or small arrays. Furthermore, once the vectorized operations complete, the values could be assigned into scalars and just do simple math or loops. I am fairly sure the core of my problems is that the 'float64' objects start propagating all over the program data structures (not in arrays) and they are considerably slower for just about everything when compared to the native python float.
Conclusion, it is not practical for me to do a massive re-structuring of code to improve speed on simple things like "a[0] < 4" (assuming "a" is an array) which is about 10 times slower than "b < 4" (assuming "b" is a float)
I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots )
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass
went down from 0.8 seconds to 0.38 seconds.
A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds.
Altogether, these simple changes got me half way to the speed I used to get in Numeric and I could not see any slow down in any of my cases that benefit from heavy array manipulation. I am out of ideas on how to improve further though.
Few questions: - Is there any interest for me to provide the exact details of the code I changed ?
- I managed to compile NumPy through setup.py but I am not sure how to force it to generate pdb files from my Visual Studio Compiler. I need the pdb files such that I can run my profiler on NumPy. Anybody has any experience with this ? (Visual Studio)
Change the compiler and linker flags in Python\Lib\distutils\msvc9compiler.py to:
self.compile_options = ['/nologo', '/Ox', '/MD', '/W3', '/DNDEBUG', '/Zi'] self.ldflags_shared = ['/DLL', '/nologo', '/INCREMENTAL:YES', '/DEBUG']
Then rebuild numpy.
Christoph
- The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
I apologize again if this topic has already been discussed.
Regards,
Raul
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Raul, This is *fantastic work*. While many optimizations were done 6 years ago as people started to convert their code, that kind of report has trailed off in the last few years. I have not seen this kind of speed-comparison for some time --- but I think it's definitely beneficial. NumPy still has quite a bit that can be optimized. I think your example is really great. Perhaps it's worth making a C-API macro out of the short-cut to the attribute string so it can be used by others. It would be interesting to see where your other slow-downs are. I would be interested to see if the slow-math of float64 is hurting you. It would be possible, for example, to do a simple subclass of the ndarray that overloads a[<integer>] to be the same as array.item(<integer>). The latter syntax returns python objects (i.e. floats) instead of array scalars. Also, it would not be too difficult to add fast-math paths for int64, float32, and float64 scalars (so they don't go through ufuncs but do scalar-math like the float and int objects in Python. A related thing we've been working on lately which might help you is Numba which might help speed up functions that have code like: "a[0] < 4" : http://numba.pydata.org. Numba will translate the expression a[0] < 4 to a machine-code address-lookup and math operation which is *much* faster when a is a NumPy array. Presently this requires you to wrap your function call in a decorator: from numba import autojit @autojit def function_to_speed_up(...): pass In the near future (2-4 weeks), numba will grow the experimental ability to basically replace all your function calls with @autojit versions in a Python function. I would love to see something like this work: python -m numba filename.py To get an effective autojit on all the filename.py functions (and optionally on all python modules it imports). The autojit works out of the box today --- you can get Numba from PyPI (or inside of the completely free Anaconda CE) to try it out. Best, -Travis On Dec 2, 2012, at 7:28 PM, Raul Cota wrote:
Hello,
First a quick summary of my problem and at the end I include the basic changes I am suggesting to the source (they may benefit others)
I am ages behind in times and I am still using Numeric in Python 2.2.3. The main reason why it has taken so long to upgrade is because NumPy kills performance on several of my tests.
I am sorry if this topic has been discussed before. I tried parsing the mailing list and also google and all I found were comments related to the fact that such is life when you use NumPy for small arrays.
In my case I have several thousands of lines of code where data structures rely heavily on Numeric arrays but it is unpredictable if the problem at hand will result in large or small arrays. Furthermore, once the vectorized operations complete, the values could be assigned into scalars and just do simple math or loops. I am fairly sure the core of my problems is that the 'float64' objects start propagating all over the program data structures (not in arrays) and they are considerably slower for just about everything when compared to the native python float.
Conclusion, it is not practical for me to do a massive re-structuring of code to improve speed on simple things like "a[0] < 4" (assuming "a" is an array) which is about 10 times slower than "b < 4" (assuming "b" is a float)
I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots )
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass
went down from 0.8 seconds to 0.38 seconds.
A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds.
Altogether, these simple changes got me half way to the speed I used to get in Numeric and I could not see any slow down in any of my cases that benefit from heavy array manipulation. I am out of ideas on how to improve further though.
Few questions: - Is there any interest for me to provide the exact details of the code I changed ?
- I managed to compile NumPy through setup.py but I am not sure how to force it to generate pdb files from my Visual Studio Compiler. I need the pdb files such that I can run my profiler on NumPy. Anybody has any experience with this ? (Visual Studio)
- The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
I apologize again if this topic has already been discussed.
Regards,
Raul
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 02/12/2012 8:31 PM, Travis Oliphant wrote:
Raul,
This is *fantastic work*. While many optimizations were done 6 years ago as people started to convert their code, that kind of report has trailed off in the last few years. I have not seen this kind of speed-comparison for some time --- but I think it's definitely beneficial.
I'll clean up a bit as a Macro and comment.
NumPy still has quite a bit that can be optimized. I think your example is really great. Perhaps it's worth making a C-API macro out of the short-cut to the attribute string so it can be used by others. It would be interesting to see where your other slow-downs are. I would be interested to see if the slow-math of float64 is hurting you. It would be possible, for example, to do a simple subclass of the ndarray that overloads a[<integer>] to be the same as array.item(<integer>). The latter syntax returns python objects (i.e. floats) instead of array scalars.
Also, it would not be too difficult to add fast-math paths for int64, float32, and float64 scalars (so they don't go through ufuncs but do scalar-math like the float and int objects in Python.
Thanks. I'll dig a bit more into the code.
A related thing we've been working on lately which might help you is Numba which might help speed up functions that have code like: "a[0] < 4" : http://numba.pydata.org.
Numba will translate the expression a[0] < 4 to a machine-code address-lookup and math operation which is *much* faster when a is a NumPy array. Presently this requires you to wrap your function call in a decorator:
from numba import autojit
@autojit def function_to_speed_up(...): pass
In the near future (2-4 weeks), numba will grow the experimental ability to basically replace all your function calls with @autojit versions in a Python function. I would love to see something like this work:
python -m numba filename.py
To get an effective autojit on all the filename.py functions (and optionally on all python modules it imports). The autojit works out of the box today --- you can get Numba from PyPI (or inside of the completely free Anaconda CE) to try it out.
This looks very interesting. Will check it out.
Best,
-Travis
On Dec 2, 2012, at 7:28 PM, Raul Cota wrote:
Hello,
First a quick summary of my problem and at the end I include the basic changes I am suggesting to the source (they may benefit others)
I am ages behind in times and I am still using Numeric in Python 2.2.3. The main reason why it has taken so long to upgrade is because NumPy kills performance on several of my tests.
I am sorry if this topic has been discussed before. I tried parsing the mailing list and also google and all I found were comments related to the fact that such is life when you use NumPy for small arrays.
In my case I have several thousands of lines of code where data structures rely heavily on Numeric arrays but it is unpredictable if the problem at hand will result in large or small arrays. Furthermore, once the vectorized operations complete, the values could be assigned into scalars and just do simple math or loops. I am fairly sure the core of my problems is that the 'float64' objects start propagating all over the program data structures (not in arrays) and they are considerably slower for just about everything when compared to the native python float.
Conclusion, it is not practical for me to do a massive re-structuring of code to improve speed on simple things like "a[0] < 4" (assuming "a" is an array) which is about 10 times slower than "b < 4" (assuming "b" is a float)
I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots )
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass
went down from 0.8 seconds to 0.38 seconds.
A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds.
Altogether, these simple changes got me half way to the speed I used to get in Numeric and I could not see any slow down in any of my cases that benefit from heavy array manipulation. I am out of ideas on how to improve further though.
Few questions: - Is there any interest for me to provide the exact details of the code I changed ?
- I managed to compile NumPy through setup.py but I am not sure how to force it to generate pdb files from my Visual Studio Compiler. I need the pdb files such that I can run my profiler on NumPy. Anybody has any experience with this ? (Visual Studio)
- The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
I apologize again if this topic has already been discussed.
Regards,
Raul
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Dec 3, 2012 at 1:28 AM, Raul Cota
I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots )
If the problem is the exception construction, then maybe this would work about as well? if (PyObject_HasAttrString(obj, "__array_priority__") { ret = PyObject_GetAttrString(obj, "__array_priority__"); } else { ret = NULL; } If so then it would be an easier and more reliable way to accomplish this.
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass
went down from 0.8 seconds to 0.38 seconds.
Huh, why is PyObject_GetBuffer even getting called in this case?
A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds.
I can see why we'd call PyObject_GetBuffer in this case, but not why it would take 2/3rds of the total run-time...
- The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
I'm confused here -- first you say that your problems would be fixed if a[0] gave you a native float, but then you say that a.item(0) (which is basically a[0] that gives a native float) is still too slow? (OTOH at 40% speedup is pretty good, even if it is just a microbenchmark :-).) Array scalars are definitely pretty slow: In [9]: timeit a[0] 1000000 loops, best of 3: 151 ns per loop In [10]: timeit a.item(0) 10000000 loops, best of 3: 169 ns per loop In [11]: timeit a[0] < 35.0 1000000 loops, best of 3: 989 ns per loop In [12]: timeit a.item(0) < 35.0 1000000 loops, best of 3: 233 ns per loop It is probably possible to make numpy scalars faster... I'm not even sure why they go through the ufunc machinery, like Travis said, since they don't even follow the ufunc rules: In [3]: np.array(2) * [1, 2, 3] # 0-dim array coerces and broadcasts Out[3]: array([2, 4, 6]) In [4]: np.array(2)[()] * [1, 2, 3] # scalar acts like python integer Out[4]: [1, 2, 3, 1, 2, 3] But you may want to experiment a bit more to make sure this is actually the problem. IME guesses about speed problems are almost always wrong (even when I take this rule into account and only guess when I'm *really* sure). -n
On Mon, Dec 3, 2012 at 6:14 AM, Nathaniel Smith
On Mon, Dec 3, 2012 at 1:28 AM, Raul Cota
wrote: I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots )
If the problem is the exception construction, then maybe this would work about as well?
if (PyObject_HasAttrString(obj, "__array_priority__") { ret = PyObject_GetAttrString(obj, "__array_priority__"); } else { ret = NULL; }
If so then it would be an easier and more reliable way to accomplish this.
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass
went down from 0.8 seconds to 0.38 seconds.
Huh, why is PyObject_GetBuffer even getting called in this case?
A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds.
I can see why we'd call PyObject_GetBuffer in this case, but not why it would take 2/3rds of the total run-time...
- The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
I'm confused here -- first you say that your problems would be fixed if a[0] gave you a native float, but then you say that a.item(0) (which is basically a[0] that gives a native float) is still too slow? (OTOH at 40% speedup is pretty good, even if it is just a microbenchmark :-).) Array scalars are definitely pretty slow:
In [9]: timeit a[0] 1000000 loops, best of 3: 151 ns per loop
In [10]: timeit a.item(0) 10000000 loops, best of 3: 169 ns per loop
In [11]: timeit a[0] < 35.0 1000000 loops, best of 3: 989 ns per loop
In [12]: timeit a.item(0) < 35.0 1000000 loops, best of 3: 233 ns per loop
It is probably possible to make numpy scalars faster... I'm not even sure why they go through the ufunc machinery, like Travis said, since they don't even follow the ufunc rules:
In [3]: np.array(2) * [1, 2, 3] # 0-dim array coerces and broadcasts Out[3]: array([2, 4, 6])
In [4]: np.array(2)[()] * [1, 2, 3] # scalar acts like python integer Out[4]: [1, 2, 3, 1, 2, 3]
I thought it still behaves like a numpy "animal"
np.array(-2)[()] ** [1, 2, 3] array([-2, 4, -8]) np.array(-2)[()] ** 0.5 nan
np.array(-2).item() ** [1, 2, 3] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for ** or pow(): 'int' and 'list' np.array(-2).item() ** 0.5 Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: negative number cannot be raised to a fractional power
np.array(0)[()] ** (-1) inf np.array(0).item() ** (-1) Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: 0.0 cannot be raised to a negative power
and similar I often try to avoid python scalars to avoid "surprising" behavior, and try to work defensively or fixed bugs by switching to np.power(...) (for example in the distributions). Josef
But you may want to experiment a bit more to make sure this is actually the problem. IME guesses about speed problems are almost always wrong (even when I take this rule into account and only guess when I'm *really* sure).
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 03/12/2012 4:14 AM, Nathaniel Smith wrote:
On Mon, Dec 3, 2012 at 1:28 AM, Raul Cota
wrote: I finally decided to track down the problem and I started by getting Python 2.6 from source and profiling it in one of my cases. By far the biggest bottleneck came out to be PyString_FromFormatV which is a function to assemble a string for a Python error caused by a failure to find an attribute when "multiarray" calls PyObject_GetAttrString. This function seems to get called way too often from NumPy. The real bottleneck of trying to find the attribute when it does not exist is not that it fails to find it, but that it builds a string to set a Python error. In other words, something as simple as "a[0] < 3.5" internally result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to if (PyList_CheckExact(obj) || (Py_None == obj) || PyTuple_CheckExact(obj) || PyFloat_CheckExact(obj) || PyInt_CheckExact(obj) || PyString_CheckExact(obj) || PyUnicode_CheckExact(obj)){ //Avoid expensive calls when I am sure the attribute //does not exist ret = NULL; } else{ ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots ) If the problem is the exception construction, then maybe this would work about as well?
if (PyObject_HasAttrString(obj, "__array_priority__") { ret = PyObject_GetAttrString(obj, "__array_priority__"); } else { ret = NULL; }
If so then it would be an easier and more reliable way to accomplish this.
I did think of that one but at least in Python 2.6 the implementation is just a wrapper to PyObject_GetAttrSting that clears the error """ PyObject_HasAttrString(PyObject *v, const char *name) { PyObject *res = PyObject_GetAttrString(v, name); if (res != NULL) { Py_DECREF(res); return 1; } PyErr_Clear(); return 0; } """ so it is just as bad when it fails and a waste when it succeeds (it will end up finding it twice). In my opinion, Python's source code should offer a version of PyObject_GetAttrString that does not raise an error but that is a completely different topic.
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this, for i in xrange(1000000): if a[1] < 35.0: pass
went down from 0.8 seconds to 0.38 seconds. Huh, why is PyObject_GetBuffer even getting called in this case?
Sorry for being misleading in an already long and confusing email. PyObject_GetBuffer is not getting called doing an "if" call. This call showed up in my profiler as a time consuming task that raised python errors unnecessarily (not nearly as bad as often as PyObject_GetAttrString ) but since I was already there I decided to look into it as well. The point I was trying to make was that I did both changes (avoiding PyObject_GetBuffer, PyObject_GetAttrString) when I came up with the times.
A bogus test like this, for i in xrange(1000000): a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds. I can see why we'd call PyObject_GetBuffer in this case, but not why it would take 2/3rds of the total run-time...
Same scenario. This total time includes both changes (avoiding PyObject_GetBuffer, PyObject_GetAttrString). If my memory helps, I believe PyObject_GetBuffer gets called once for every 9 times of a call to PyObject_GetAttrString in this scenario.
- The core of my problems I think boil down to things like this s = a[0] assigning a float64 into s as opposed to a native float ? Is there any way to hack code to change it to extract a native float instead ? (probably crazy talk, but I thought I'd ask :) ). I'd prefer to not use s = a.item(0) because I would have to change too much code and it is not even that much faster. For example, for i in xrange(1000000): if a.item(1) < 35.0: pass is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes) I'm confused here -- first you say that your problems would be fixed if a[0] gave you a native float, but then you say that a.item(0) (which is basically a[0] that gives a native float) is still too slow?
Don't get me wrong. I am confused too when it gets beyond my suggested changes :) . My "theory" for saying that a.item(1) is not the same to a[1] returning a float was that perhaps the overhead of the dot operator is too big. At the end of the day, I do want to profile NumPy and find out if there is anything I can do to speed things up. To bring things more into context, I don't really care to speed up a bogus loop with if statements. My bottom line is, - I am focusing on two cases from our software that take 141.8 seconds and 40 seconds respectively using Numeric and Python 2.2.3 . - These cases now take 229 seconds and 62 seconds respectively using NumPy and Python 2.6 . This is quite a bit of a slow down taking into account that Python code that uses only native objects is quite a bit faster in Python 2.6 Vs Python 2.2 Both cases (like most of our software) use array operations as much as possible and revert down to scalar operations when it is not practical to do otherwise. I am not saying it is impossible to optimize even more, it is just not practical. I ran the profiler on Python 2.6 and I found the bottlenecks I reported in this email. Both of my cases are now running at 170 and 50 seconds respectively. In other words, I am "almost" back to where I want to be. The improvement is huge, but in my opinion it still uncomfortably far from what it used to be in Numeric and I worry that there may be other spots in our software that may be affected on a more meaningful way that I just have not noticed.
(OTOH at 40% speedup is pretty good, even if it is just a microbenchmark :-).) Array scalars are definitely pretty slow:
In [9]: timeit a[0] 1000000 loops, best of 3: 151 ns per loop
In [10]: timeit a.item(0) 10000000 loops, best of 3: 169 ns per loop
In [11]: timeit a[0] < 35.0 1000000 loops, best of 3: 989 ns per loop
In [12]: timeit a.item(0) < 35.0 1000000 loops, best of 3: 233 ns per loop
It is probably possible to make numpy scalars faster... I'm not even sure why they go through the ufunc machinery, like Travis said, since they don't even follow the ufunc rules:
In [3]: np.array(2) * [1, 2, 3] # 0-dim array coerces and broadcasts Out[3]: array([2, 4, 6])
In [4]: np.array(2)[()] * [1, 2, 3] # scalar acts like python integer Out[4]: [1, 2, 3, 1, 2, 3]
But you may want to experiment a bit more to make sure this is actually the problem. IME guesses about speed problems are almost always wrong (even when I take this rule into account and only guess when I'm *really* sure).
I agree 100% about the pitfalls of guessing. Thanks to Christoph's suggestion I should be able to profile NumPy now. Thanks for your comments, Raul
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Raul, Thanks for doing this work -- both the profiling and actual suggestions for how to improve the code -- whoo hoo! In general, it seem that numpy performance for scalars and very small arrays (i.e (2,), (3,) maybe (3,3), the kind of thing that you'd use to hold a coordinate point or the like, not small as in "fits in cache") is pretty slow. In principle, a basic array scalar operation could be as fast as a numpy native numeric type, and it would be great is small array operations were, too. It may be that the route to those performance improvements is special-case code, which is ugly, but I think could really be worth it for the common types and operations. I'm really out of my depth for suggesting (or contributing) actual soluitons, but +1 for the idea! -Chris NOTE: Here's a example of what I'm talking about -- say you are scaling an (x,y) point by a (s_x, s_y) scale factor: def numpy_version(point, scale): return point * scale def tuple_version(point, scale): return (point[0] * scale[0], point[1] * scale[1]) In [36]: point_arr, sca scale scale_arr In [36]: point_arr, scale_arr Out[36]: (array([ 3., 5.]), array([ 2., 3.])) In [37]: timeit tuple_version(point, scale) 1000000 loops, best of 3: 397 ns per loop In [38]: timeit numpy_version(point_arr, scale_arr) 100000 loops, best of 3: 2.32 us per loop It would be great if numpy could get closer to tuple performance for this sor tof thing... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Chris, thanks for the feedback, fyi, the minor changes I talked about have different performance enhancements depending on scenario, e.g, 1) Array * Array point = array( [2.0, 3.0]) scale = array( [2.4, 0.9] ) retVal = point * scale #The line above runs 1.1 times faster with my new code (but it runs 3 times faster in Numeric in Python 2.2) #i.e. pretty meaningless but still far from old Numeric 2) Array * Tuple (item by item) point = array( [2.0, 3.0]) scale = (2.4, 0.9 ) retVal = point[0] < scale[0], point[1] < scale[1] #The line above runs 1.8 times faster with my new code (but it runs 6.8 times faster in Numeric in Python 2.2) #i.e. pretty decent speed up but quite far from old Numeric I am not saying that I would ever do something exactly like (2) in my code nor am I saying that the changes in NumPy Vs Numeric are not beneficial. My point is that performance in small size problems is fairly far from what it used to be in Numeric particularly when dealing with scalars and it is problematic at least to me. I am currently looking around to see if there are practical ways to speed things up without slowing anything else down. Will keep you posted. regards, Raul On 03/12/2012 12:49 PM, Chris Barker - NOAA Federal wrote:
Raul,
Thanks for doing this work -- both the profiling and actual suggestions for how to improve the code -- whoo hoo!
In general, it seem that numpy performance for scalars and very small arrays (i.e (2,), (3,) maybe (3,3), the kind of thing that you'd use to hold a coordinate point or the like, not small as in "fits in cache") is pretty slow. In principle, a basic array scalar operation could be as fast as a numpy native numeric type, and it would be great is small array operations were, too.
It may be that the route to those performance improvements is special-case code, which is ugly, but I think could really be worth it for the common types and operations.
I'm really out of my depth for suggesting (or contributing) actual soluitons, but +1 for the idea!
-Chris
NOTE: Here's a example of what I'm talking about -- say you are scaling an (x,y) point by a (s_x, s_y) scale factor:
def numpy_version(point, scale): return point * scale
def tuple_version(point, scale): return (point[0] * scale[0], point[1] * scale[1])
In [36]: point_arr, sca scale scale_arr
In [36]: point_arr, scale_arr Out[36]: (array([ 3., 5.]), array([ 2., 3.]))
In [37]: timeit tuple_version(point, scale) 1000000 loops, best of 3: 397 ns per loop
In [38]: timeit numpy_version(point_arr, scale_arr) 100000 loops, best of 3: 2.32 us per loop
It would be great if numpy could get closer to tuple performance for this sor tof thing...
-Chris
participants (6)
-
Chris Barker - NOAA Federal
-
Christoph Gohlke
-
josef.pktd@gmail.com
-
Nathaniel Smith
-
Raul Cota
-
Travis Oliphant