[Numpy-discussion] Numpy speed ups to simple tasks - final findings and suggestions

Sat Jan 5 01:09:17 EST 2013

On 04/01/2013 5:44 PM, Nathaniel Smith wrote:
> On Fri, Jan 4, 2013 at 11:36 PM, Raul Cota <raul at virtualmaterials.com> wrote:
>> On 04/01/2013 2:33 PM, Nathaniel Smith wrote:
>>> On Fri, Jan 4, 2013 at 6:50 AM, Raul Cota <raul at virtualmaterials.com> wrote:
>>>> On 02/01/2013 7:56 AM, Nathaniel Smith wrote:
>>>>> But, it's almost certainly possible to optimize numpy's float64 (and
>>>>> friends), so that they are themselves (almost) as fast as the native
>>>>> python objects. And that would help all the code that uses them, not
>>>>> just the ones where regular python floats could be substituted
>>>>> instead. Have you tried profiling, say, float64 * float64 to figure
>>>>> out where the bottlenecks are?
>>>> Seems to be split between
>>>> - (primarily) the memory allocation/deallocation of the float64 that is
>>>> created from the operation float64 * float64. This is the reason why float64
>>>> * Pyfloat got improved with one of my changes because PyFloat was being
>>>> internally converted into a float64 before doing the multiplication.
>>>>
>>>> - the rest of the time is the actual multiplication path way.
>>> Running a quick profile on Linux x86-64 of
>>>     x = np.float64(5.5)
>>>     for i in xrange(n):
>>>        x * x
>>> I find that ~50% of the total CPU time is inside feclearexcept(), the
>>> function which resets the floating point error checking registers --
>>> and most of this is inside a single instruction, stmxcsr ("store sse
>>> control register").
>> I find strange you don't see bottleneck in allocation of a float64.
>>
>> is it easy for you to profile this ?
>>
>> x = np.float64(5.5)
>> y = 5.5
>> for i in xrange(n):
>>       x * y
>>
>> numpy internally translates y into a float64 temporarily and then
>> discards it and I seem to remember is a bit over two times slower than x * x
> Yeah, seems to be dramatically slower. Using ipython's handy interface
> to the timeit[1] library:
>
> In [1]: x = np.float64(5.5)
>
> In [2]: y = 5.5
>
> In [3]: timeit x * y
> 1000000 loops, best of 3: 725 ns per loop
>
> In [4]: timeit x * x
> 1000000 loops, best of 3: 283 ns per loop

I haven't been using timeit because the bulk of what we are doing 
includes comparing against Python 2.2 and Numeric and timeit did not 
exist then. Can't wait to finally officially upgrade our main product.

> But we already figured out how to (mostly) fix this part, right?

Correct

Cheers,

Raul

> I was
> curious about the Float64*Float64 case, because that's the one that
> was still slow after those first two patches. (And, yes, like you say,
> when I run x*y in the profiler then there's a huge amount of overhead
> in PyArray_GetPriority and object allocation/deallocation).
>
>> I will try to do your suggestions on
>>
>> PyUFunc_clearfperr/PyUFunc_getfperror
>>
>> and see what I get. Haven't gotten around to get going with being able
>> to do a pull request for the previous stuff. if changes are worth while
>> would it be ok if I also create one for this ?
> First, to be clear, it's always OK to do a pull request -- the worst
> that can happen is that we all look it over carefully and decide that
> it's the wrong approach and don't merge. In my email before I just
> wanted to give you some clear suggestions on a good way to get
> started, we wouldn't have like kicked you out or something if you did
> it differently :-)
>
> And, yes, assuming my analysis so far is correct we would definitely
> be interested in major speedups that have no other user-visible
> effects... ;-)
>
> -n
>
> [1] http://docs.python.org/2/library/timeit.html
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>