[Numpy-discussion] optimizing power() for complex and real cases

Wed Feb 15 10:03:04 EST 2006

David M. Cooke wrote:

>[changed subject to reflect this better]
>
>Tim Hochberg <tim.hochberg at cox.net> writes:
>  
>
>>David M. Cooke wrote:
>>    
>>
>>>Tim Hochberg <tim.hochberg at cox.net> writes:
>>>      
>>>
>>>>David M. Cooke wrote:
>>>> 2, Don't distinguish between scalars and arrays -- that just makes
>>>>things harder to explain.
>>>>        
>>>>
>>>Makes the optimizations better, though.
>>>
>>>
>>>      
>>>
>>Ah, Because you can hoist all the checks for what type of optimization
>>to do, if any, out of the core loop, right? That's a good point. Still
>>I'm not keen on a**b having different performance *and* different
>>results depending on whether b is a scalar or matrix. The first thing
>>to do is to measure how much overhead doing the optimization element
>>by element is going to add. Assuming that it's signifigant that leaves
>>us with the familiar dilema: fast, simple or general purpose; pick any
>>two.
>>
>>1. Do what I've proposed: optimize things at the c_pow level. This is
>>general purpose and relatively simple to implement (since we can steal
>>most of the code from complexobject.c). It may have a signifigant
>>speed penalty versus 2 though:
>>
>>2. Do what you've proposed: optimize things at the ufunc level. This
>>fast and relatively simple to implement. It's more limited in scope
>>and a bit harder to explain than 2.
>>
>>3. Do both. This is straightforward, but adds a bunch of extra code
>>paths with all the attendant required testing and possibility for
>>bugs. So, fast, general purpose, but not simple.
>>    
>>
>
>Start with #1, then try #2. The problem with #2 is that you still have
>to include #1: if you're doing x**y when y is an array, then you have
>do if (y==2) etc. checks in your inner loop anyways. In that case, you
>might as well do it in nc_pow. At that point, it may be better to move
>the #1 optimization to the level of x.__pow__ (see below).
>  
>
OK.

>>>The speed comes from the inlining of how to do the power _outside_ the
>>>inner loop. The reason x**2, etc. are slower currently is there is a
>>>function call in the inner loop. Your's and mine C library's pow()
>>>function mostly likely does something like I have above, for a single
>>>case: pow(x, 2.0) is calculated as x*x. However, each time through it
>>>has do decide _how_ to do it.
>>>
>>>
>>>      
>>>
>>Part of our difference in perspective comes from the fact that I've
>>just been staring at the guts of complex power. In this case you
>>always have function calls at present, even for s*s. (At least I'm
>>fairly certain that doesn't get inlined although I haven't checked).
>>Since much of the work I do is with complex matrices,  it's
>>appropriate that I focus on this.
>>    
>>
>
>Ah, ok, now things are clicking. Complex power is going to be harder,
>because making sure that going from x**2.001 to x**2 doesn't do some
>funny complex branch cut stuff (I work in reals all the time :-)
>  
>
We're always dealing with the principle branch though, so probably we 
can just ignore any branch cut issues. We'll see I suppose.

>For the real numbers, these type of optimizations *are* a big win, and
>don't have the same type of continuity problems. I'll put them into numpy soon.
>  
>
Complex power is something like thirty times slower than s*s, so there 
is some room for optimization there. Some peril too though, as you note.

>  
>
>>Have you measured the effect of a function call on the speed here, or
>>is that just an educated guess. If it's an educated guess, it's
>>probably worth determining how much of speed hit the function call
>>actually causes.  I was going to try to get a handle on this by
>>comparing multiplication of Complex numbers (which requires a function
>>call plus more math), with multiplication of Floats which does not.
>>Perversly, the Complex multiplication came out marginally faster,
>>which is hard to explain whichever way you look at it.
>>
>>    
>>
>>>>>timeit.Timer("a*b", "from numpy import arange; a =
>>>>>          
>>>>>
>>arange(10000)+0j; b = arange(10000)+0j").time
>>it(10000)
>>3.2974959107959876
>>    
>>
>>>>>timeit.Timer("a*b", "from numpy import arange; a = arange(10000);
>>>>>b
>>>>>          
>>>>>
>>= arange(10000)").timeit(100
>>00)
>>3.4541194481425919
>>    
>>
>
>You're not multiplying floats in the last one: you're multiplying
>integers. You either need to use a = arange(10000.0), or a =
>arange(10000.0, dtype=float) (to be more specific).
>  
>
Doh! Now that's embarassing. Well, when I actually measure float 
multiplication, it's between two and ten times as fast. For small arrays 
(N=1000) the difference is relatively small (3.5x), I assume because the 
setup overhead starts to dominate. For midsized array (N=10,000) the 
difference is larger (10x). For large arrays (N=100,000) the difference 
becomes small (2x). Presumably the memory is no longer fitting in the 
cache and I'm  having memory bandwidth issues.

>Your integer numbers are about 3x better than mine, though (difference
>in architecture, maybe? I'm on an Athlon64).
>  
>
I'm on a P4.

>  
>
>>>That's why I limited the optimization to scalar exponents: array
>>>exponents would mean it's about as slow as the pow() call, even if the
>>>checks were inlined into the loop. It would probably be even slower
>>>for the non-optimized case, as you'd check for the special exponents,
>>>then call pow() if it fails (which would likely recheck the exponents).
>>>
>>>
>>>      
>>>
>>Again, here I'm thinking of the complex case. In that case at least, I
>>don't think that the non-optimized case would take a noticeable speed
>>hit. I would put it into pow itself, which already special cases a==0
>>and b==0. For float pow it might, but that's already slow, so I doubt
>>that it would make much difference.
>>    
>>
>
>It does make a bit of difference with float pow: the general case
>slows down a bit.
>  
>
OK.  I was hoping that the difference would not be noticeable. I suspect 
that in the complex pow case, that will be the case since complex pow is 
so slow to begin with and since it already is doing some testing on the 
exponent.

>>>Maybe a simple way to add this is to rewrite x.__pow__() as something
>>>like the C equivalent of
>>>
>>>def __pow__(self, p):
>>>   if p is not a scalar:
>>>      return power(self, p)
>>>   elif p == 1:
>>>      return p
>>>   elif p == 2:
>>>      return square(self)
>>>   elif p == 3:
>>>      return cube(self)
>>>   elif p == 4:
>>>      return power_4(self)
>>>   elif p == 0:
>>>      return ones(self.shape, dtype=self.dtype)
>>>   elif p == -1:
>>>      return 1.0/self
>>>   elif p == 0.5:
>>>      return sqrt(self)
>>>
>>>and add ufuncs square, cube, power_4 (etc.).
>>>      
>>>
>>It sounds like we need to benchmark some stuff and see what we come up
>>with. One approach would be for each of us to implement this for one
>>time (say float) and see how the approaches compare speed wise. That's
>>not entirely fair as my approach will do much better at complex than
>>float I believe, but it's certainly easier.
>>    
>>
>
>The way the ufuncs are templated, we can split out the complex
>routines easily enough.
>
>Here's what I propose:
>
>- add a square() ufunc, where square(x) == x*x (but faster of course)
>- I'll fiddle with the floats
>- you fiddle with the complex numbers :-)
>
>I've created a new branch in svn, at
>http://svn.scipy.org/svn/numpy/branches/power_optimization
>to do this fiddling. The changes below I mention are all checked in as
>revision 2104 (http://projects.scipy.org/scipy/numpy/changeset/2104).
>
>I've added a square() ufunc to the power_optimization branch because
>I'd argue that it's probably *the* most common use of **. I've
>implemented it, and it's as fast as a*a for reals, and runs in 2/3 the
>time as a*a for complex (which makes sense: squaring a complex numbers
>has 3 real multiplications, while multiplying has 4 in the (simple)
>scheme [1]).
>
>At least with square(), there's no argument about continuity, as it
>only squares :-).
>  
>
Actually that's not entirely true,  This get's back to the odd 
inaccuracy that started this thread:

 >>> array([1234567j])**2
array([ -1.52415568e+12+0.00018665j])

If you special case this the extraneous imaginary value will vanish, but 
raising things to the 2.000001 power or 1.999999 power will still be off 
a similar amount. I played with this a bunch though and I couldn't come 
up with a plausible way for this to make things break. I suspect I could 
come up with an implausible one though.

[some time passes while I sleep and otherwise try to live a normal life....]

>The next step I'd suggest is special-casing x.__pow__, like I suggest
>above. We could just test for integer scalar exponents (0, 1, 2), and
>just special-case those (returning ones(), x.copy(), square(x)), and
>leave all the rest to power().
>  
>
As I've been thinking about this some more, I think the correct thing to 
do is not to mess with  the power ufuncs at all. Rather in x.__pow__ 
(since I don't know that there's anywhere else to do it), after the 
above checks check the types of the array and in the cases where the 
first argument is a float or complex and the second argument is some 
sort of integer array. This would be dispatched to some other helper 
function instead of the normal pow_ufunc. In other words, optimize:

A**2, A**2.0, A**(2.0+0j), etc

and

A**array([1,2,3])

but not

A**array[1.0, 2.0, 3.0]

I think that this takes care of the optimization slowing down power for 
general floats and optimizes the only array-array case that really matter.

>I've also checked in code to the power_optimization branch that
>special cases power(x, <scalar exponent>), or anytime the basic ufunc
>gets called with a stride of 0 for the exponent. It doesn't do complex
>x, so no problems on your side, but it's a good chunk faster for this
>case than what we've got now. One reason I'm also looking at adding
>square() is because my optimization of power() makes x**2 run (only)
>1.5 slower than x*x (and I can't for the life of me see where that 0.5
>is coming from! It should be 1.0 like square()!).
>  
>
I just checked out your branch and I'll fiddle with the complex stuff as 
I've got time. I've got relatives in town this week, so my extra cycles 
just dropped precipitously.

>[1] which brings up another point. Would using the 3-multiplication
>version for complex multiplication be good? There might be some
>effects with cancellation errors due to the extra subtractions...
>  
>
I'm inclined to leave this be for now. Both because I'm unsure of the 
rounding issues and because I'm not sure it would actually be faster. It 
has one less multiplication, but several more additions, so it would 
depend on the relative speed add/sub with multiplication and how things 
end up getting scheduled in the FP pipeline. At some point it's probably 
worth trying; if it turns out to be signifigantly faster we can think 
about rounding then. If it's not faster then no need to think.

-tim