[Numpy-discussion] RE: default axis for numarray

Tue Jun 11 15:01:02 EDT 2002

> From: cbarker at localhost.localdomain
[mailto:cbarker at localhost.localdomain]
> 
> eric jones wrote:
> > The default axis choice influences how people choose to lay out
their
> > data in arrays.  If the default is to sum down columns, then users
lay
> > out their data so that this is the order of computation.
> 
> This is absolutely true. I definitely choose my data layout to that
the
> various rank reducing operators do what I want. Another reason to have
> consistency. So I don't really care which way is default, so the
default
> might as well be the better performing option.
> 
> Of course, compatibility with previous versions is helpful
too...arrrgg!
> 
> What kind of a performance difference are we talking here anyway?

Guess I ought to test instead of just saying it is so...

I ran the following test of summing 200 sets of 10000 
numbers.  I expected a speed-up of about 2... I didn't get it.  They 
are pretty much the same speed on my machine.?? (more later)

C:\WINDOWS\system32>python
ActivePython 2.2.1 Build 222 (ActiveState Corp.) based on
Python 2.2.1 (#34, Apr 15 2002, 09:51:39) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from Numeric import *
>>> import time
>>> a = ones((10000,200),Float) * arange(10000)[:,NewAxis]
>>> b = ones((200,10000),Float) * arange(10000)[NewAxis,:]
>>> t1 = time.clock();x=sum(a,axis=0);t2 = time.clock();print t2-t1
0.0772411018719
>>> t1 = time.clock();x=sum(b,axis=-1);t2 = time.clock();print t2-t1
0.079615705348

I also tried FFT, and did see a difference -- a speed up of 1.5+:

>>> q = ones((1024,1024),Float)
>>> t1 = time.clock();x = FFT.fft(q,axis=0);t2 = time.clock();print
t2-t1
0.907373143793
>>> t1 = time.clock();x= FFT.fft(q,axis=-1);t2 = time.clock();print
t2-t1
0.581641800843
>>> .907/.581
1.5611015490533564

Same in scipy

>>> from scipy import *
>>> a = ones((1024,1024),Float)
>>> import time
>>> t1 = time.clock(); q = fft(a,axis=0); t2 = time.clock();print t2-t1
0.870259488287
>>> t1 = time.clock(); q = fft(a,axis=-1); t2 = time.clock();print t2-t1
0.489512214541
>>> t1 = time.clock(); q = fft(a,axis=0); t2 = time.clock();print t2-t1
0.849266317367
>>> .849/.489
1.7361963190184049

So why is sum() the same speed for both cases? I don't know.  I wrote
a quick C program that is similar to how Numeric loops work, and I saw 
about a factor of 4 improvement by summing rows instead columns:

C:\home\eric\wrk\axis_speed>gcc -O2 axis.c
C:\home\eric\wrk\axis_speed>a
summing rows (sec): 0.040000
summing columns (sec): 0.160000
pass

These numbers are more like what I expected to see in the Numeric tests,

but they are strange when compared to the Numeric timings -- the row sum

is twice as fast as Numeric while the column sum is twice as slow.  
Because all the work is done in C and we're summing reasonably long 
arrays, the Numeric and C versions should be roughly the same speed.  
I can understand why summing rows is twice as fast in my C routine -- 
the Numeric loop code is not going to win awards for being optimal.  
What I don't understand is why the column summation is twice as slow in
my C code as in Numeric.  This should not be.  I've posted it below in 
case someone can enlighten me.

I think in general, you should see a speed up of 1.5+ when the summing
over the "faster" axis.  This holds true for fft in Python and my sum
in C.  As to why I don't in Numeric's sum(), I'm not sure.  It is 
certainly true that non-strided access makes the best use of 
cache and *usually* is faster.

eric

------------------------------------------------------------------------
--
#include <malloc.h>
#include <time.h>

int main()
{
    double *a, *sum1, *sum2;
    int i, j, si, sj, ind, I, J;
    int small=200, big=10000;
    time_t t1, t2;

    I = small;
    J = big;
    si = big;
    sj = 1;
    a = (double*)malloc(I*J*sizeof(double));
    sum1 = (double*)malloc(small*sizeof(double));
    sum2 = (double*)malloc(small*sizeof(double));

    //set memory
    for(i = 0; i < I; i++)
    {
        sum1[i] = 0;
        sum2[i] = 0;
        ind = si * i;
        for(j = 0; j < J; j++)
        {
            a[ind] = (double)j;            
            ind += sj;            
        }
        ind += si;
    }

    t1 = clock();
    for(i = 0; i < I; i++)
    {
        sum1[i] = 0;
        ind = si * i;
        for(j = 0; j < J; j++)
        {
            sum1[i] += a[ind];            
            ind += sj;            
        }
        ind += si;
    }
    t2 = clock();
    printf("summing rows (sec): %f\n", (t2-t1)/(float)CLOCKS_PER_SEC);

    I = big;
    J = small;
    sj = big;
    si = 1;
    t1 = clock();

    //set memory
    for(i = 0; i < I; i++)
    {        
        ind = si * i;
        for(j = 0; j < J; j++)
        {
            a[ind] = (double)i;            
            ind += sj;            
        }
        ind += si;
    }

    for(j = 0; j < J; j++)
    {
        sum2[j] = 0;
        ind = sj * j;
        for(i = 0; i < I; i++)
        {
            sum2[j] += a[ind];               
            ind += si;            
        }
    }
    t2 = clock();
    printf("summing columns (sec): %f\n",
(t2-t1)/(float)CLOCKS_PER_SEC);

    for (i=0; i < small; i++)
    {
        if(sum1[i] != sum2[i])
            printf("failure %d, %f %f\n", i, sum1[i], sum2[i]);
    }            
    printf("pass %f\n", sum1[0]);
    return 0;
}