[Numpy-discussion] fast iteration (I think I've got it)

Tue Jan 1 15:34:06 EST 2008

On 01/01/2008, Neal Becker <ndbecker2 at gmail.com> wrote:
> This is a c-api question.
>
> I'm trying to get iterators that are both fast and reasonably general.  I
> did confirm that iterating using just the general PyArrayIterObject
> protocol is not as fast as using c-style pointers for contiguous arrays.

I'd like to point out that "contiguous" can be misleading as it is
used in numpy. An array is flagged contiguous if all the elements are
contiguous *and* they are arranged as a C-ordered multidimensional
array - i.e., the last index changes the fastest as you walk through
the array elements, and the next-to-last chenges the next fastest, and
so on.

> Please confirm if my understanding is correct.  There are 2 cases to
> consider.  There are plain old dense arrays, which will be contiguous, and
> there are array 'views' or 'slices'.  The views will not be contiguous.
> However, in all cases, the strides between elements in one dimension are
> constant.  This last point is key.  As long as the stride in the last
> dimension is a constant, a fast iterator is possible.
>
> I put together a small test, which seems to work for the cases I tried.
> The idea is to use the general iterator (via PyArray_IterAllButAxis), but
> iteration over the last dimension is done with a c-style pointer.

This is not an unreasonable approach, but it can suffer badly if the
last dimension is "small", for example if you have an array of shape
(1000,1000,2). Such arrays can arise for example from working with
complex numbers as pairs of real numbers.

Where is the acceleration coming from in this code? You can walk
through a multidimensional array in pure C, just incrementing pointers
as needed, without too much difficulty.

If you want to save bookkeeping overhead, you can partially flatten
the array: if your last dimension has stride 7 and length 11, and your
second-to-last dimension has stride 77, you can flatten (using
reshape, in python, to get a view) the last two dimensions of the
array and use a nice quick iterator on the combined dimension. For a
"C contiguous" array, this means you just walk through the whole array
with a single layer of looping.

I suspect that the place you will see big slowdowns is where data is
accessed in an order that doesn't make sense from a cache point of
view. If four-byte chunks of data are spaced every eight bytes, then
the processor spends twice as much time waiting for cache loads as if
they are spaced every four bytes. And most numpy tasks are almost
certainly memory-bound - all the ufuncs I can think of almost
certainly are (arctan of a double almost certainly takes less time
than loading and storing a double). If you walk through an array in
the wrong order - changing the big strides fastest and the small
strides slowest - you'll almost certainly have to shuffle the whole
array through cache eight times or more. This is even ignoring cache
misses. (To cut down on cache misses you can use the GCC prefetch
instructions, or god-knows-what equivalent in other compilers.)

I seem to recall that numpy already reorders its walks through arrays
within ufuncs - does it do so with cache in mind? For that matter, is
information on the CPU's caches available to numpy?

Anne

P.S. Here is code to flatten an array as much as possible without
changing the order of flatiter:

def flatteniter(A):
    i=0
    while i<len(A.shape)-1:
        if A.shape[i+1]*A.strides[i+1]==A.strides[i]:
            newshape = A.shape[:i]+(A.shape[i]*A.shape[i+1],)+A.shape[i+2:]
            A = A.reshape(newshape)
        else:
            i+=1
    return A

This is a quick hack, as you can see I was slack about docstrings and
whatnot. But routinely running it over every array before applying
flatiter wouldn't hurt anything (though for all I know flatiter may do
this already). -A