[Numpy-discussion] Speed bottlenecks on simple tasks - suggested improvement

Mon Dec 3 11:26:39 EST 2012

On 03/12/2012 4:14 AM, Nathaniel Smith wrote:
> On Mon, Dec 3, 2012 at 1:28 AM, Raul Cota <raul at virtualmaterials.com> wrote:
>> I finally decided to track down the problem and I started by getting
>> Python 2.6 from source and profiling it in one of my cases. By far the
>> biggest bottleneck came out to be PyString_FromFormatV which is a
>> function to assemble a string for a Python error caused by a failure to
>> find an attribute when "multiarray" calls PyObject_GetAttrString. This
>> function seems to get called way too often from NumPy. The real
>> bottleneck of trying to find the attribute when it does not exist is not
>> that it fails to find it, but that it builds a string to set a Python
>> error. In other words, something as simple as "a[0] < 3.5" internally
>> result in a call to set a python error .
>>
>> I downloaded NumPy code (for Python 2.6) and tracked down all the calls
>> like this,
>>
>>    ret = PyObject_GetAttrString(obj, "__array_priority__");
>>
>> and changed to
>>       if (PyList_CheckExact(obj) ||  (Py_None == obj) ||
>>           PyTuple_CheckExact(obj) ||
>>           PyFloat_CheckExact(obj) ||
>>           PyInt_CheckExact(obj) ||
>>           PyString_CheckExact(obj) ||
>>           PyUnicode_CheckExact(obj)){
>>           //Avoid expensive calls when I am sure the attribute
>>           //does not exist
>>           ret = NULL;
>>       }
>>       else{
>>           ret = PyObject_GetAttrString(obj, "__array_priority__");
>>
>> ( I think I found about 7 spots )
> If the problem is the exception construction, then maybe this would
> work about as well?
>
> if (PyObject_HasAttrString(obj, "__array_priority__") {
>      ret = PyObject_GetAttrString(obj, "__array_priority__");
> } else {
>      ret = NULL;
> }
>
> If so then it would be an easier and more reliable way to accomplish this.

I did think of that one but at least in Python 2.6 the implementation is 
just a wrapper to PyObject_GetAttrSting that clears the error

"""
PyObject_HasAttrString(PyObject *v, const char *name)
{
     PyObject *res = PyObject_GetAttrString(v, name);
     if (res != NULL) {
         Py_DECREF(res);
         return 1;
     }
     PyErr_Clear();
     return 0;
}
"""

so it is just as bad when it fails and a waste when it succeeds (it will 
end up finding it twice).
In my opinion, Python's source code should offer a version of 
PyObject_GetAttrString that does not raise an error but that is a 
completely different topic.

>> I also noticed (not as bad in my case) that calls to PyObject_GetBuffer
>> also resulted in Python errors being set thus unnecessarily slower code.
>>
>> With this change, something like this,
>>       for i in xrange(1000000):
>>           if a[1] < 35.0:
>>               pass
>>
>> went down from 0.8 seconds to 0.38 seconds.
> Huh, why is PyObject_GetBuffer even getting called in this case?

Sorry for being misleading in an already long and confusing email.
PyObject_GetBuffer
is not getting called doing an "if" call. This call showed up in my 
profiler as a time consuming task that raised python errors 
unnecessarily (not nearly as bad as often as PyObject_GetAttrString ) 
but since I was already there I decided to look into it as well.

The point I was trying to make was that I did both changes (avoiding 
PyObject_GetBuffer, PyObject_GetAttrString) when I came up with the times.

>> A bogus test like this,
>> for i in xrange(1000000):
>>           a = array([1., 2., 3.])
>>
>> went down from 8.5 seconds to 2.5 seconds.
> I can see why we'd call PyObject_GetBuffer in this case, but not why
> it would take 2/3rds of the total run-time...

Same scenario. This total time includes both changes (avoiding 
PyObject_GetBuffer, PyObject_GetAttrString).
If my memory helps, I believe PyObject_GetBuffer gets called once for 
every 9 times of a call to PyObject_GetAttrString in this scenario.

>> - The core of my problems I think boil down to things like this
>> s = a[0]
>> assigning a float64 into s as opposed to a native float ?
>> Is there any way to hack code to change it to extract a native float
>> instead ? (probably crazy talk, but I thought I'd ask :) ).
>> I'd prefer to not use s = a.item(0) because I would have to change too
>> much code and it is not even that much faster. For example,
>>       for i in xrange(1000000):
>>           if a.item(1) < 35.0:
>>               pass
>> is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
> I'm confused here -- first you say that your problems would be fixed
> if a[0] gave you a native float, but then you say that a.item(0)
> (which is basically a[0] that gives a native float) is still too slow?

Don't get me wrong. I am confused too when it gets beyond my suggested 
changes :) . My "theory" for saying that a.item(1) is not the same to 
a[1] returning a float was that perhaps the overhead of the dot operator 
is too big.
At the end of the day, I do want to profile NumPy and find out if there 
is anything I can do to speed things up.

To bring things more into context, I don't really care to speed up a 
bogus loop with if statements.
My bottom line is,
- I am focusing on two cases from our software that take 141.8 seconds 
and 40 seconds respectively using Numeric and Python 2.2.3 .
- These cases now take 229 seconds and 62 seconds respectively using 
NumPy and Python 2.6 . This is quite a bit of a slow down taking into 
account that Python code that uses only native objects is quite a bit 
faster in Python 2.6 Vs Python 2.2

Both cases (like most of our software) use array operations as much as 
possible and revert down to scalar operations when it is not practical 
to do otherwise. I am not  saying it is impossible to optimize even 
more, it is just not practical.

I ran the profiler on Python 2.6 and I found the bottlenecks I reported 
in this email. Both of my cases are now running at 170 and 50 seconds 
respectively. In other words, I am "almost" back to where I want to be.

The improvement is huge, but in my opinion it still uncomfortably far 
from what it used to be in Numeric and I worry that there may be other 
spots in our software that may be affected on a more meaningful way that 
I just have not noticed.

> (OTOH at 40% speedup is pretty good, even if it is just a
> microbenchmark :-).) Array scalars are definitely pretty slow:
>
> In [9]: timeit a[0]
> 1000000 loops, best of 3: 151 ns per loop
>
> In [10]: timeit a.item(0)
> 10000000 loops, best of 3: 169 ns per loop
>
> In [11]: timeit a[0] < 35.0
> 1000000 loops, best of 3: 989 ns per loop
>
> In [12]: timeit a.item(0) < 35.0
> 1000000 loops, best of 3: 233 ns per loop
>
> It is probably possible to make numpy scalars faster... I'm not even
> sure why they go through the ufunc machinery, like Travis said, since
> they don't even follow the ufunc rules:
>
> In [3]: np.array(2) * [1, 2, 3]  # 0-dim array coerces and broadcasts
> Out[3]: array([2, 4, 6])
>
> In [4]: np.array(2)[()] * [1, 2, 3]  # scalar acts like python integer
> Out[4]: [1, 2, 3, 1, 2, 3]
>
> But you may want to experiment a bit more to make sure this is
> actually the problem. IME guesses about speed problems are almost
> always wrong (even when I take this rule into account and only guess
> when I'm *really* sure).

I agree 100% about the pitfalls of guessing.
Thanks to Christoph's suggestion I should be able to profile NumPy now.

Thanks for your comments,

Raul

> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion