[Numpy-discussion] strange performance on mac 2.5/2.6 32/64 bit

Tue Nov 3 11:29:49 EST 2009

Hi,

I'm not sure if this is of much interest but it's been really puzzling
me so I thought I'd ask.

In an earlier post I described how I was surprised a simple f2py
wrapped fortran bincount was 4x faster than np.bincount - but that
differential only seemed to be on my mac; on moving to linux they both
took more or less the same time. I'm trying to work out if it is worth
moving some of my bottlenecks to fortran (most of which are np
builtins). So far it looks like it is - but only on my mac and only
32bit (see below).
Well the only explanation I thought was that the gcc-4.0 used to build
numpy on a mac didn't perform so well, so after upgrading to snow
leopard I've been trying to look at this again. I was hoping I could
get the equivalent performance on my mac, like on linux, which would
result in the np c stuff being a couple of times faster.

So far, with Python 2.6.3 in 64 bit - numpy seems to be significantly
slower and my fortran code _much_ slower - even from the same
compiler. Can anyone help me understand what is going on?

I have only been able to build 32 bit numpy against 2.5.4 with apple
gcc-4.0 and 64 bit numpy against 2.6.3 universal with gcc-4.2. I
haven't been able to get a numpy I can import on 2.6.3 in 32 bit mode
( http://projects.scipy.org/numpy/ticket/1221 ).

Here are the results for python.org 32 bit 2.5.4, numpy compiled with
apple gcc 4.0, f2py using att gfortran 4.2:
In [2]: timeit x = np.random.random_integers(0,1023,100000000).astype(int)
1 loops, best of 3: 2.86 s per loop
In [3]: x = np.random.random_integers(0,1023,100000000).astype(int)
In [4]: timeit np.bincount(x)
1 loops, best of 3: 435 ms per loop
In [6]: timeit gf42.bincount(x,1024)
10 loops, best of 3: 129 ms per loop
In [7]: np.__version__
Out[7]: '1.4.0.dev7618'

And for self-built (apple gcc 4.2) 64 bit 2.6.3, numpy compiled with
apple gcc 4.2, f2py using the same att gfortran 4.2:
In [3]: timeit x = np.random.random_integers(0,1023,100000000).astype(int)
1 loops, best of 3: 3.91 s per loop  # 37% slower than 32bit
In [4]: x = np.random.random_integers(0,1023,100000000).astype(int)
In [5]: timeit np.bincount(x)
1 loops, best of 3: 582 ms per loop # 34 % slower than 32 bit
In [8]: timeit gf42_64.bincount(x,1024)
1 loops, best of 3: 803 ms per loop # 522% slower than 32 bit

So why is there this big difference in performance? I'd really like to
know why the fortran compiled with the same compiler is so much slower
in 64 bit mode. As far as I can tell the flags used are the same. Also
why is numpy slower. I was surprised the I was able to import the 64
bit universal module built with f2py from 2.6 inside 32 bit 3.5 and
there it was quick again - so it seems the x64_64 code generated by
the fortran compiler is much slower (rather than any wrappers or
such).

I tried using some more recent gfortrans from macports - but could
only use them to build modules against the 64 bit python/numpy since I
couldn't find a way to get f2py to force 32 bit output. But the
performance was more or less the same (always several times slower the
32 bit att gfortran).

Any advice appreciated.

Cheers

Robin

--------
subroutine bincount (x,c,n,m)
    implicit none
    integer, intent(in) :: n,m
    integer, dimension(0:n-1), intent(in) :: x
    integer, dimension(0:m-1), intent(out) :: c
    integer :: i

    c = 0
    do i = 0, n-1
        c(x(i)) = c(x(i)) + 1
    end do
end