[Numpy-discussion] Speed bottlenecks on simple tasks - suggested improvement
Raul Cota
raul at virtualmaterials.com
Mon Dec 3 11:26:39 EST 2012
On 03/12/2012 4:14 AM, Nathaniel Smith wrote:
> On Mon, Dec 3, 2012 at 1:28 AM, Raul Cota <raul at virtualmaterials.com> wrote:
>> I finally decided to track down the problem and I started by getting
>> Python 2.6 from source and profiling it in one of my cases. By far the
>> biggest bottleneck came out to be PyString_FromFormatV which is a
>> function to assemble a string for a Python error caused by a failure to
>> find an attribute when "multiarray" calls PyObject_GetAttrString. This
>> function seems to get called way too often from NumPy. The real
>> bottleneck of trying to find the attribute when it does not exist is not
>> that it fails to find it, but that it builds a string to set a Python
>> error. In other words, something as simple as "a[0] < 3.5" internally
>> result in a call to set a python error .
>>
>> I downloaded NumPy code (for Python 2.6) and tracked down all the calls
>> like this,
>>
>> ret = PyObject_GetAttrString(obj, "__array_priority__");
>>
>> and changed to
>> if (PyList_CheckExact(obj) || (Py_None == obj) ||
>> PyTuple_CheckExact(obj) ||
>> PyFloat_CheckExact(obj) ||
>> PyInt_CheckExact(obj) ||
>> PyString_CheckExact(obj) ||
>> PyUnicode_CheckExact(obj)){
>> //Avoid expensive calls when I am sure the attribute
>> //does not exist
>> ret = NULL;
>> }
>> else{
>> ret = PyObject_GetAttrString(obj, "__array_priority__");
>>
>> ( I think I found about 7 spots )
> If the problem is the exception construction, then maybe this would
> work about as well?
>
> if (PyObject_HasAttrString(obj, "__array_priority__") {
> ret = PyObject_GetAttrString(obj, "__array_priority__");
> } else {
> ret = NULL;
> }
>
> If so then it would be an easier and more reliable way to accomplish this.
I did think of that one but at least in Python 2.6 the implementation is
just a wrapper to PyObject_GetAttrSting that clears the error
"""
PyObject_HasAttrString(PyObject *v, const char *name)
{
PyObject *res = PyObject_GetAttrString(v, name);
if (res != NULL) {
Py_DECREF(res);
return 1;
}
PyErr_Clear();
return 0;
}
"""
so it is just as bad when it fails and a waste when it succeeds (it will
end up finding it twice).
In my opinion, Python's source code should offer a version of
PyObject_GetAttrString that does not raise an error but that is a
completely different topic.
>> I also noticed (not as bad in my case) that calls to PyObject_GetBuffer
>> also resulted in Python errors being set thus unnecessarily slower code.
>>
>> With this change, something like this,
>> for i in xrange(1000000):
>> if a[1] < 35.0:
>> pass
>>
>> went down from 0.8 seconds to 0.38 seconds.
> Huh, why is PyObject_GetBuffer even getting called in this case?
Sorry for being misleading in an already long and confusing email.
PyObject_GetBuffer
is not getting called doing an "if" call. This call showed up in my
profiler as a time consuming task that raised python errors
unnecessarily (not nearly as bad as often as PyObject_GetAttrString )
but since I was already there I decided to look into it as well.
The point I was trying to make was that I did both changes (avoiding
PyObject_GetBuffer, PyObject_GetAttrString) when I came up with the times.
>> A bogus test like this,
>> for i in xrange(1000000):
>> a = array([1., 2., 3.])
>>
>> went down from 8.5 seconds to 2.5 seconds.
> I can see why we'd call PyObject_GetBuffer in this case, but not why
> it would take 2/3rds of the total run-time...
Same scenario. This total time includes both changes (avoiding
PyObject_GetBuffer, PyObject_GetAttrString).
If my memory helps, I believe PyObject_GetBuffer gets called once for
every 9 times of a call to PyObject_GetAttrString in this scenario.
>> - The core of my problems I think boil down to things like this
>> s = a[0]
>> assigning a float64 into s as opposed to a native float ?
>> Is there any way to hack code to change it to extract a native float
>> instead ? (probably crazy talk, but I thought I'd ask :) ).
>> I'd prefer to not use s = a.item(0) because I would have to change too
>> much code and it is not even that much faster. For example,
>> for i in xrange(1000000):
>> if a.item(1) < 35.0:
>> pass
>> is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
> I'm confused here -- first you say that your problems would be fixed
> if a[0] gave you a native float, but then you say that a.item(0)
> (which is basically a[0] that gives a native float) is still too slow?
Don't get me wrong. I am confused too when it gets beyond my suggested
changes :) . My "theory" for saying that a.item(1) is not the same to
a[1] returning a float was that perhaps the overhead of the dot operator
is too big.
At the end of the day, I do want to profile NumPy and find out if there
is anything I can do to speed things up.
To bring things more into context, I don't really care to speed up a
bogus loop with if statements.
My bottom line is,
- I am focusing on two cases from our software that take 141.8 seconds
and 40 seconds respectively using Numeric and Python 2.2.3 .
- These cases now take 229 seconds and 62 seconds respectively using
NumPy and Python 2.6 . This is quite a bit of a slow down taking into
account that Python code that uses only native objects is quite a bit
faster in Python 2.6 Vs Python 2.2
Both cases (like most of our software) use array operations as much as
possible and revert down to scalar operations when it is not practical
to do otherwise. I am not saying it is impossible to optimize even
more, it is just not practical.
I ran the profiler on Python 2.6 and I found the bottlenecks I reported
in this email. Both of my cases are now running at 170 and 50 seconds
respectively. In other words, I am "almost" back to where I want to be.
The improvement is huge, but in my opinion it still uncomfortably far
from what it used to be in Numeric and I worry that there may be other
spots in our software that may be affected on a more meaningful way that
I just have not noticed.
> (OTOH at 40% speedup is pretty good, even if it is just a
> microbenchmark :-).) Array scalars are definitely pretty slow:
>
> In [9]: timeit a[0]
> 1000000 loops, best of 3: 151 ns per loop
>
> In [10]: timeit a.item(0)
> 10000000 loops, best of 3: 169 ns per loop
>
> In [11]: timeit a[0] < 35.0
> 1000000 loops, best of 3: 989 ns per loop
>
> In [12]: timeit a.item(0) < 35.0
> 1000000 loops, best of 3: 233 ns per loop
>
> It is probably possible to make numpy scalars faster... I'm not even
> sure why they go through the ufunc machinery, like Travis said, since
> they don't even follow the ufunc rules:
>
> In [3]: np.array(2) * [1, 2, 3] # 0-dim array coerces and broadcasts
> Out[3]: array([2, 4, 6])
>
> In [4]: np.array(2)[()] * [1, 2, 3] # scalar acts like python integer
> Out[4]: [1, 2, 3, 1, 2, 3]
>
> But you may want to experiment a bit more to make sure this is
> actually the problem. IME guesses about speed problems are almost
> always wrong (even when I take this rule into account and only guess
> when I'm *really* sure).
I agree 100% about the pitfalls of guessing.
Thanks to Christoph's suggestion I should be able to profile NumPy now.
Thanks for your comments,
Raul
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list