When memory access is a bottleneck
A topic that often comes up on the list is that arr.sum(axis=-1) is faster than arr.sum(axis=0). For C ordered arrays, moving along the last axis moves the smallest amount in memory. And moving small amounts in memory keeps the data in cache longer. Can I use that fact to speed up calculations along axis=0? As a test I used the nanmean function from the Bottleneck package. I created a second version of the function that calculates the nanmean of two columns at a time (code below). I figured that would be a more efficient use of cache. It is faster for many array sizes but not for all. Is there anything I can do about that? Is the speed up really due to cache? The smallest arrays I tried should fit entirely in cache, so perhaps unrolling the loop is helping the compiler be more efficient? Yeah, so obviously, I don't understand what is going on. Faster:
a = np.random.rand(10,10) timeit nanmean_2d_float64_axis0(a) 1000000 loops, best of 3: 1.24 us per loop timeit nanmean_2d_float64_axis0_double(a) 1000000 loops, best of 3: 1.17 us per loop
a = np.random.rand(100,100) timeit nanmean_2d_float64_axis0(a) 100000 loops, best of 3: 16 us per loop timeit nanmean_2d_float64_axis0_double(a) 100000 loops, best of 3: 15.7 us per loop
a = np.random.rand(1000,1000) timeit nanmean_2d_float64_axis0(a) 100 loops, best of 3: 8.57 ms per loop timeit nanmean_2d_float64_axis0_double(a) 100 loops, best of 3: 5.02 ms per loop
a = np.random.rand(10000,100) timeit nanmean_2d_float64_axis0(a) 100 loops, best of 3: 3.52 ms per loop timeit nanmean_2d_float64_axis0_double(a) 100 loops, best of 3: 3.35 ms per loop
Slower:
a = np.random.rand(100,10000) timeit nanmean_2d_float64_axis0(a) 100 loops, best of 3: 2.22 ms per loop timeit nanmean_2d_float64_axis0_double(a) 100 loops, best of 3: 2.57 ms per loop
Code (Bottleneck simplified BSD license): @cython.boundscheck(False) @cython.wraparound(False) def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a): "Mean of 2d array with dtype=float64 along axis=0 ignoring NaNs." cdef int count = 0 cdef np.float64_t asum = 0, ai cdef Py_ssize_t i0, i1 cdef np.npy_intp *dim dim = PyArray_DIMS(a) cdef int n0 = dim[0] cdef int n1 = dim[1] cdef np.npy_intp *dims = [n1] cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims, NPY_float64, 0) for i1 in range(n1): asum = 0 count = 0 for i0 in range(n0): ai = a[i0, i1] if ai == ai: asum += ai count += 1 if count > 0: y[i1] = asum / count else: y[i1] = NAN return y @cython.boundscheck(False) @cython.wraparound(False) def nanmean_2d_float64_axis0_double(np.ndarray[np.float64_t, ndim=2] a): "Mean of 2d array with dtype=float64 along axis=0 ignoring NaNs." cdef int count = 0, count2 = 0 cdef np.float64_t asum = 0, asum2 = 0, ai cdef Py_ssize_t i0, i1, i11 cdef np.npy_intp *dim dim = PyArray_DIMS(a) cdef int n0 = dim[0] cdef int n1 = dim[1] cdef np.npy_intp *dims = [n1] cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims, NPY_float64, 0) for i1 in range(0,n1,2): asum = 0 count = 0 asum2 = 0 count2 = 0 i11 = i1 + 1 for i0 in range(n0): ai = a[i0, i1] if ai == ai: asum += ai count += 1 ai = a[i0, i11] if ai == ai: asum2 += ai count2 += 1 if count > 0: y[i1] = asum / count else: y[i1] = NAN if count2 > 0: y[i11] = asum / count else: y[i11] = NAN return y
participants (1)
-
Keith Goodman