[Numpy-discussion] Numpy speed ups to simple tasks - final findings and suggestions

Fri Jan 4 16:33:25 EST 2013

On Fri, Jan 4, 2013 at 6:50 AM, Raul Cota <raul at virtualmaterials.com> wrote:
>
> On 02/01/2013 7:56 AM, Nathaniel Smith wrote:
>> But, it's almost certainly possible to optimize numpy's float64 (and
>> friends), so that they are themselves (almost) as fast as the native
>> python objects. And that would help all the code that uses them, not
>> just the ones where regular python floats could be substituted
>> instead. Have you tried profiling, say, float64 * float64 to figure
>> out where the bottlenecks are?
>
> Seems to be split between
> - (primarily) the memory allocation/deallocation of the float64 that is
> created from the operation float64 * float64. This is the reason why float64
> * Pyfloat got improved with one of my changes because PyFloat was being
> internally converted into a float64 before doing the multiplication.
>
> - the rest of the time is the actual multiplication path way.

Running a quick profile on Linux x86-64 of
  x = np.float64(5.5)
  for i in xrange(n):
     x * x
I find that ~50% of the total CPU time is inside feclearexcept(), the
function which resets the floating point error checking registers --
and most of this is inside a single instruction, stmxcsr ("store sse
control register"). It's possible that this is different on windows
(esp. since apparently our fpe exception handling apparently doesn't
work on windows[1]), but the total time you measure for both
PyFloat*PyFloat and Float64*Float64 match mine almost exactly, so most
likely we have similar CPUs that are doing a similar amount of work in
both cases.

The way we implement floating point error checking is basically:
    PyUFunc_clearfperr()
    <do the floating point operation>
    if (PyUFunc_getfperror() & BAD_STUFF) {
        <raise a warning or whatever>
    }

Some points that you may find interesting though:

- The way we define these functions, both PyUFunc_clearfperr() and
PyUFunc_getfperror() clear the flags. However, for PyUFunc_getfperror,
this is just pointless. We could simply remove this, and expect to see
a ~25% speedup in Float64*Float64 without any downside.

- Numpy's default behaviour is to always check for an warn on floating
point errors. This seems like it's probably the correct default.
However, if you aren't worried about this for your use code, you could
disable these warnings with np.seterr(all="ignore"). (And you'll get
similar error-checking to what PyFloat does.) At the moment, that
won't speed anything up. But we could easily then fix it so that the
PyUFunc_clearfperr/PyUFunc_getfperror code checks for whether errors
are ignored, and disables itself. This together with the previous
change should get you a ~50% speedup in Float64*Float64, without
having to change any of numpy's semantics.

- Bizarrely, Numpy still checks the floating point flags on integer
operations, at least for integer scalars. So 50% of the time in
Int64*Int64 is also spent in fiddling with floating point exception
flags. That's also some low-hanging fruit right there... (to be fair,
this isn't *quite* as trivial to fix as it could be, because the
integer overflow checking code sets the floating point unit's
"overflow" flag to signal a problem, and we'd need to pull this out to
a thread-local variable or something before disabling the floating
point checks entirely in integer code. But still, not a huge problem.)

-n

[1] https://github.com/numpy/numpy/issues/2350