# [Numpy-discussion] Re: indexing problem

Tim Hochberg tim.hochberg at cox.net
Tue Feb 14 08:58:03 EST 2006

```David M. Cooke wrote:

>Tim Hochberg <tim.hochberg at cox.net> writes:
>
>
>
>>David M. Cooke wrote:
>>
>>
>>
>>>Gary Ruben <gruben at bigpond.net.au> writes:
>>>
>>>
>>>
>>>
>>>
>>>>Tim Hochberg wrote:
>>>><snip>
>>>>
>>>>
>>>>
>>>>
>>>>>However, I'm not convinced this is a good idea for numpy. This would
>>>>>introduce a discontinuity in a**b that could cause problems in some
>>>>>cases. If, for instance, one were running an iterative solver of
>>>>>some sort (something I've been known to do), and b was a free
>>>>>variable, it could get stuck at b = 2 since things would go
>>>>>nonmonotonic there.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>I don't quite understand the problem here. Tim says Python special
>>>>cases integer powers but then talks about the problem when b is a
>>>>floating type. I think special casing x**2 and maybe even x**3 when
>>>>the power is an integer is still a good idea.
>>>>
>>>>
>>>>
>>>>
>>>Well, what I had done with Numeric did special case x**0, x**1,
>>>x**(-1), x**0.5, x**2, x**3, x**4, and x**5, and only when the
>>>exponent was a scalar (so x**y where y was an array wouldn't be). I
>>>think this is very useful, as I don't want to microoptimize my code to
>>>x*x instead of x**2. The reason for just scalar exponents was so
>>>choosing how to do the power was lifted out of the inner loop. With
>>>that, x**2 was as fast as x*x.
>>>
>>>
>>>
>>>
>>This is getting harder to object to since, try as I might I can't get
>>a**b to go nonmontonic in the vicinity of b==2. I run out of floating
>>point resolution before the slight shift due to special casing at 2
>>results in nonmonoticity. I suspect that I could manage it with enough
>>work, but it would require some unlikely function of a**b. I'm not
>>sure if I'm really on board with this, but let me float a slightly
>>modified proposal anyway:
>>
>>   1. numpy.power stays as it is now. That way in the rare case that
>>someone runs into trouble they can drop back to power. Alternatively
>>there could be rawpower and power where rawpower has the current
>>behaviour. While the name rawpower sounds cool/cheesy, power is used
>>infrequently enough that I doubt it matters whether it has these
>>special case optimazations.
>>
>>
>
>+1
>
>
>
>>  2, Don't distinguish between scalars and arrays -- that just makes
>>things harder to explain.
>>
>>
>
>Makes the optimizations better, though.
>
>
Ah, Because you can hoist all the checks for what type of optimization
to do, if any, out of the core loop, right? That's a good point. Still
I'm not keen on a**b having different performance *and* different
results depending on whether b is a scalar or matrix. The first thing to
do is to measure how much overhead doing the optimization element  by
element is going to add. Assuming that it's signifigant that leaves us
with the familiar dilema: fast, simple or general purpose; pick any two.

1. Do what I've proposed: optimize things at the c_pow level. This is
general purpose and relatively simple to implement (since we can steal
most of the code from complexobject.c). It may have a signifigant speed
penalty versus 2 though:

2. Do what you've proposed: optimize things at the ufunc level. This
fast and relatively simple to implement. It's more limited in scope and
a bit harder to explain than 2.

3. Do both. This is straightforward, but adds a bunch of extra code
paths with all the attendant required testing and possibility for bugs.
So, fast, general purpose, but not simple.

>
>
>>  3. Python itself special cases all integral powers between -100 and
>>100. Beg/borrow/steal their code. This makes it easier to explain
>>since all smallish integer powers are just automagically faster.
>>
>>  4. Is the performance advantage of special casing a**0.5
>>signifigant? If so use the above trick to special case all half
>>integral  and integral powers between -N and N. Since sqrt probably
>>chews up some time the cutoff. The cutoff probably shifts somewhat if
>>we're optimizing half integral as well as integral powers. Perhaps N
>>would be 32 or 64.
>>
>>The net result of this is that a**b would be computed using a
>>combination of repeated multiplication and sqrt for  real integral and
>>half integral values of b between -N and N. That seems simpler to
>>explain and somewhat more useful as well.
>>
>>It sounds like a fun project although I'm not certain yet that it's a
>>good idea.
>>
>>
>
>Basically, my Numeric code looked like this:
>
>#define POWER_UFUNC3(prefix, basetype, exptype, outtype)                \
>static void prefix##_power(char **args, int *dimensions,                \
>                           int *steps, void *func) {                    \
>    int i, cis1=steps, cis2=steps, cos=steps, n=dimensions; \
>    int is1=cis1/sizeof(basetype);                                      \
>    int is2=cis2/sizeof(exptype);                                       \
>    int os=cos/sizeof(outtype);                                         \
>    basetype *i1 = (basetype *)(args);                               \
>    exptype *i2=(exptype *)(args);                                   \
>    outtype *op=(outtype *)(args);                                   \
>    if (is2 == 0) {                                                     \
>        exptype exponent = i2;                                       \
>        if (POWER_equal(exponent, 0.0)) {                               \
>            for (i = 0; i < n; i++, op += os) {                         \
>                POWER_one((*op))                                        \
>            }                                                           \
>        } else if (POWER_equal(exponent, 1.0)) {                        \
>            for (i = 0; i < n; i++, i1 += is1, op += os) {              \
>                *op = *i1;                                              \
>            }                                                           \
>        } else if (POWER_equal(exponent, 2.0)) {                        \
>            for (i = 0; i < n; i++, i1 += is1, op += os) {              \
>                POWER_square((*op),(*i1))                               \
>            }                                                           \
>        } else if (POWER_equal(exponent, -1.0)) {                       \
>            for (i = 0; i < n; i++, i1 += is1, op += os) {              \
>                POWER_inverse((*op),(*i1))                              \
>            }                                                           \
>        } else if (POWER_equal(exponent, 3.0)) {                        \
>            for (i = 0; i < n; i++, i1 += is1, op += os) {              \
>                POWER_cube((*op),(*i1))                                 \
>            }                                                           \
>        } else if (POWER_equal(exponent, 4.0)) {                        \
>            for (i = 0; i < n; i++, i1 += is1, op += os) {              \
>                POWER_fourth((*op),(*i1))                               \
>            }                                                           \
>        } else if (POWER_equal(exponent, 0.5)) {                        \
>            for (i = 0; i < n; i++, i1 += is1, op += os) {              \
>                POWER_sqrt((*op),(*i1))                                 \
>            }                                                           \
>        } else {                                                        \
>            for (i = 0; i < n; i++, i1 += is1, op += os) {              \
>                POWER_pow((*op), (*i1), (exponent))                     \
>            }                                                           \
>        }                                                               \
>    } else {                                                            \
>        for (i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) {                 \
>            POWER_pow((*op), (*i1), (*i2))                              \
>        }                                                               \
>    }                                                                   \
>}
>#define POWER_UFUNC(prefix, type) POWER_UFUNC3(prefix, type, type, type)
>
>#define FTYPE float
>#define POWER_equal(x,y)   x == y
>#define POWER_one(o)       o = 1.0;
>#define POWER_square(o,x)  o = x*x;
>#define POWER_inverse(o,x) o = 1.0 / x;
>#define POWER_cube(o,x)    FTYPE y=x; o = y*y*y;
>#define POWER_fourth(o,x)  FTYPE y=x, s = y*y; o = s * s;
>#define POWER_sqrt(o,x)    o = sqrt(x);
>#define POWER_pow(o,x,n)   o = pow(x, n);
>POWER_UFUNC(FLOAT, float)
>POWER_UFUNC3(FLOATD, float, double, float)
>
>plus similiar definitions for float, double, complex float, and
>complex double. Using the POWER_square, etc. macros means the complex
>
>The speed comes from the inlining of how to do the power _outside_ the
>inner loop. The reason x**2, etc. are slower currently is there is a
>function call in the inner loop. Your's and mine C library's pow()
>function mostly likely does something like I have above, for a single
>case: pow(x, 2.0) is calculated as x*x. However, each time through it
>has do decide _how_ to do it.
>
>
Part of our difference in perspective comes from the fact that I've just
been staring at the guts of complex power. In this case you always have
function calls at present, even for s*s. (At least I'm fairly certain
that doesn't get inlined although I haven't checked). Since much of the
work I do is with complex matrices,  it's appropriate that I focus on this.

Have you measured the effect of a function call on the speed here, or is
that just an educated guess. If it's an educated guess, it's probably
worth determining how much of speed hit the function call actually
causes.  I was going to try to get a handle on this by comparing
multiplication of Complex numbers (which requires a function call plus
more math), with multiplication of Floats which does not. Perversly, the
Complex multiplication came out marginally faster, which is hard to
explain whichever way you look at it.

>>> timeit.Timer("a*b", "from numpy import arange; a =
arange(10000)+0j; b = arange(10000)+0j").time
it(10000)
3.2974959107959876
>>> timeit.Timer("a*b", "from numpy import arange; a = arange(10000); b
= arange(10000)").timeit(100
00)
3.4541194481425919

>That's why I limited the optimization to scalar exponents: array
>exponents would mean it's about as slow as the pow() call, even if the
>checks were inlined into the loop. It would probably be even slower
>for the non-optimized case, as you'd check for the special exponents,
>then call pow() if it fails (which would likely recheck the exponents).
>
>
Again, here I'm thinking of the complex case. In that case at least, I
don't think that the non-optimized case would take a noticeable speed
hit. I would put it into pow itself, which already special cases a==0
and b==0. For float pow it might, but that's already slow, so I doubt
that it would make much difference.

>Maybe a simple way to add this is to rewrite x.__pow__() as something
>like the C equivalent of
>
>def __pow__(self, p):
>    if p is not a scalar:
>       return power(self, p)
>    elif p == 1:
>       return p
>    elif p == 2:
>       return square(self)
>    elif p == 3:
>       return cube(self)
>    elif p == 4:
>       return power_4(self)
>    elif p == 0:
>       return ones(self.shape, dtype=self.dtype)
>    elif p == -1:
>       return 1.0/self
>    elif p == 0.5:
>       return sqrt(self)
>
>and add ufuncs square, cube, power_4 (etc.).
>
>

It sounds like we need to benchmark some stuff and see what we come up
with. One approach would be for each of us to implement this for one
time (say float) and see how the approaches compare speed wise. That's
not entirely fair as my approach will do much better at complex than
float I believe, but it's certainly easier.

regards,

-tim

```