[Numpy-discussion] When memory access is a bottleneck

Fri Feb 25 12:30:05 EST 2011

A topic that often comes up on the list is that arr.sum(axis=-1) is
faster than arr.sum(axis=0). For C ordered arrays, moving along the
last axis moves the smallest amount in memory. And moving small
amounts in memory keeps the data in cache longer. Can I use that fact
to speed up calculations along axis=0?

As a test I used the nanmean function from the Bottleneck package. I
created a second version of the function that calculates the nanmean
of two columns at a time (code below). I figured that would be a more
efficient use of cache. It is faster for many array sizes but not for
all. Is there anything I can do about that?

Is the speed up really due to cache? The smallest arrays I tried
should fit entirely in cache, so perhaps unrolling the loop is helping
the compiler be more efficient? Yeah, so obviously, I don't understand
what is going on.

Faster:

>> a = np.random.rand(10,10)
>> timeit nanmean_2d_float64_axis0(a)
1000000 loops, best of 3: 1.24 us per loop
>> timeit nanmean_2d_float64_axis0_double(a)
1000000 loops, best of 3: 1.17 us per loop

>> a = np.random.rand(100,100)
>> timeit nanmean_2d_float64_axis0(a)
100000 loops, best of 3: 16 us per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100000 loops, best of 3: 15.7 us per loop

>> a = np.random.rand(1000,1000)
>> timeit nanmean_2d_float64_axis0(a)
100 loops, best of 3: 8.57 ms per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100 loops, best of 3: 5.02 ms per loop

>> a = np.random.rand(10000,100)
>> timeit nanmean_2d_float64_axis0(a)
100 loops, best of 3: 3.52 ms per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100 loops, best of 3: 3.35 ms per loop

Slower:

>> a = np.random.rand(100,10000)
>> timeit nanmean_2d_float64_axis0(a)
100 loops, best of 3: 2.22 ms per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100 loops, best of 3: 2.57 ms per loop

Code (Bottleneck simplified BSD license):

@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a):
    "Mean of 2d array with dtype=float64 along axis=0 ignoring NaNs."
    cdef int count = 0
    cdef np.float64_t asum = 0, ai
    cdef Py_ssize_t i0, i1
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef np.npy_intp *dims = [n1]
    cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims,
                                              NPY_float64, 0)
    for i1 in range(n1):
        asum = 0
        count = 0
        for i0 in range(n0):
            ai = a[i0, i1]
            if ai == ai:
                asum += ai
                count += 1
        if count > 0:
            y[i1] = asum / count
        else:
            y[i1] = NAN
    return y

@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis0_double(np.ndarray[np.float64_t, ndim=2] a):
    "Mean of 2d array with dtype=float64 along axis=0 ignoring NaNs."
    cdef int count = 0, count2 = 0
    cdef np.float64_t asum = 0, asum2 = 0, ai
    cdef Py_ssize_t i0, i1, i11
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef np.npy_intp *dims = [n1]
    cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims,
                                              NPY_float64, 0)
    for i1 in range(0,n1,2):
        asum = 0
        count = 0
        asum2 = 0
        count2 = 0
        i11 = i1 + 1
        for i0 in range(n0):
            ai = a[i0, i1]
            if ai == ai:
                asum += ai
                count += 1
            ai = a[i0, i11]
            if ai == ai:
                asum2 += ai
                count2 += 1
        if count > 0:
            y[i1] = asum / count
        else:
            y[i1] = NAN
        if count2 > 0:
            y[i11] = asum / count
        else:
            y[i11] = NAN
    return y