[Numpy-discussion] Speed bottlenecks on simple tasks - suggested improvement

Mon Dec 3 06:14:13 EST 2012

On Mon, Dec 3, 2012 at 1:28 AM, Raul Cota <raul at virtualmaterials.com> wrote:
> I finally decided to track down the problem and I started by getting
> Python 2.6 from source and profiling it in one of my cases. By far the
> biggest bottleneck came out to be PyString_FromFormatV which is a
> function to assemble a string for a Python error caused by a failure to
> find an attribute when "multiarray" calls PyObject_GetAttrString. This
> function seems to get called way too often from NumPy. The real
> bottleneck of trying to find the attribute when it does not exist is not
> that it fails to find it, but that it builds a string to set a Python
> error. In other words, something as simple as "a[0] < 3.5" internally
> result in a call to set a python error .
>
> I downloaded NumPy code (for Python 2.6) and tracked down all the calls
> like this,
>
>   ret = PyObject_GetAttrString(obj, "__array_priority__");
>
> and changed to
>      if (PyList_CheckExact(obj) ||  (Py_None == obj) ||
>          PyTuple_CheckExact(obj) ||
>          PyFloat_CheckExact(obj) ||
>          PyInt_CheckExact(obj) ||
>          PyString_CheckExact(obj) ||
>          PyUnicode_CheckExact(obj)){
>          //Avoid expensive calls when I am sure the attribute
>          //does not exist
>          ret = NULL;
>      }
>      else{
>          ret = PyObject_GetAttrString(obj, "__array_priority__");
>
> ( I think I found about 7 spots )

If the problem is the exception construction, then maybe this would
work about as well?

if (PyObject_HasAttrString(obj, "__array_priority__") {
    ret = PyObject_GetAttrString(obj, "__array_priority__");
} else {
    ret = NULL;
}

If so then it would be an easier and more reliable way to accomplish this.

> I also noticed (not as bad in my case) that calls to PyObject_GetBuffer
> also resulted in Python errors being set thus unnecessarily slower code.
>
> With this change, something like this,
>      for i in xrange(1000000):
>          if a[1] < 35.0:
>              pass
>
> went down from 0.8 seconds to 0.38 seconds.

Huh, why is PyObject_GetBuffer even getting called in this case?

> A bogus test like this,
> for i in xrange(1000000):
>          a = array([1., 2., 3.])
>
> went down from 8.5 seconds to 2.5 seconds.

I can see why we'd call PyObject_GetBuffer in this case, but not why
it would take 2/3rds of the total run-time...

> - The core of my problems I think boil down to things like this
> s = a[0]
> assigning a float64 into s as opposed to a native float ?
> Is there any way to hack code to change it to extract a native float
> instead ? (probably crazy talk, but I thought I'd ask :) ).
> I'd prefer to not use s = a.item(0) because I would have to change too
> much code and it is not even that much faster. For example,
>      for i in xrange(1000000):
>          if a.item(1) < 35.0:
>              pass
> is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)

I'm confused here -- first you say that your problems would be fixed
if a[0] gave you a native float, but then you say that a.item(0)
(which is basically a[0] that gives a native float) is still too slow?
(OTOH at 40% speedup is pretty good, even if it is just a
microbenchmark :-).) Array scalars are definitely pretty slow:

In [9]: timeit a[0]
1000000 loops, best of 3: 151 ns per loop

In [10]: timeit a.item(0)
10000000 loops, best of 3: 169 ns per loop

In [11]: timeit a[0] < 35.0
1000000 loops, best of 3: 989 ns per loop

In [12]: timeit a.item(0) < 35.0
1000000 loops, best of 3: 233 ns per loop

It is probably possible to make numpy scalars faster... I'm not even
sure why they go through the ufunc machinery, like Travis said, since
they don't even follow the ufunc rules:

In [3]: np.array(2) * [1, 2, 3]  # 0-dim array coerces and broadcasts
Out[3]: array([2, 4, 6])

In [4]: np.array(2)[()] * [1, 2, 3]  # scalar acts like python integer
Out[4]: [1, 2, 3, 1, 2, 3]

But you may want to experiment a bit more to make sure this is
actually the problem. IME guesses about speed problems are almost
always wrong (even when I take this rule into account and only guess
when I'm *really* sure).

-n