fast iteration (I think I've got it)
This is a c-api question. I'm trying to get iterators that are both fast and reasonably general. I did confirm that iterating using just the general PyArrayIterObject protocol is not as fast as using c-style pointers for contiguous arrays. Please confirm if my understanding is correct. There are 2 cases to consider. There are plain old dense arrays, which will be contiguous, and there are array 'views' or 'slices'. The views will not be contiguous. However, in all cases, the strides between elements in one dimension are constant. This last point is key. As long as the stride in the last dimension is a constant, a fast iterator is possible. I put together a small test, which seems to work for the cases I tried. The idea is to use the general iterator (via PyArray_IterAllButAxis), but iteration over the last dimension is done with a c-style pointer. I have tested it with these cases, and the results appear correct: a = array ((xrange(4),xrange(4,8),xrange(8,12)), int) b = a[:,::2] c = a[:,1::2] d = a[:,-1::-1] sum3(a) 0 1 2 3 4 5 6 7 8 9 10 11 sum3(b) 0 2 4 6 8 10 sum3(c) 1 3 5 7 9 11 sum3 (d) 3 2 1 0 7 6 5 4 11 10 9 8 template<typename T> inline T sum3 (object const& o) { PyArrayObject* arr = (PyArrayObject*)(o.ptr()); int last_dim = arr->nd - 1; PyArrayIterObject* it = (PyArrayIterObject*)(PyArray_IterAllButAxis (o.ptr(), &last_dim)); for (int i = 0 ; i < arr->nd; ++i) std::cout << arr->dimensions[i] << ' '; std::cout << '\n'; while (it->index < it->size) { T * dptr = (T*)(it->dataptr); const int stride = arr->strides[last_dim]; for (int i = 0; i < arr->dimensions[last_dim]; ++i) { std::cout << *dptr << ' '; dptr += stride/sizeof(T); } std::cout << '\n'; PyArray_ITER_NEXT(it); } return 0; }
On 01/01/2008, Neal Becker
This is a c-api question.
I'm trying to get iterators that are both fast and reasonably general. I did confirm that iterating using just the general PyArrayIterObject protocol is not as fast as using c-style pointers for contiguous arrays.
I'd like to point out that "contiguous" can be misleading as it is used in numpy. An array is flagged contiguous if all the elements are contiguous *and* they are arranged as a C-ordered multidimensional array - i.e., the last index changes the fastest as you walk through the array elements, and the next-to-last chenges the next fastest, and so on.
Please confirm if my understanding is correct. There are 2 cases to consider. There are plain old dense arrays, which will be contiguous, and there are array 'views' or 'slices'. The views will not be contiguous. However, in all cases, the strides between elements in one dimension are constant. This last point is key. As long as the stride in the last dimension is a constant, a fast iterator is possible.
I put together a small test, which seems to work for the cases I tried. The idea is to use the general iterator (via PyArray_IterAllButAxis), but iteration over the last dimension is done with a c-style pointer.
This is not an unreasonable approach, but it can suffer badly if the
last dimension is "small", for example if you have an array of shape
(1000,1000,2). Such arrays can arise for example from working with
complex numbers as pairs of real numbers.
Where is the acceleration coming from in this code? You can walk
through a multidimensional array in pure C, just incrementing pointers
as needed, without too much difficulty.
If you want to save bookkeeping overhead, you can partially flatten
the array: if your last dimension has stride 7 and length 11, and your
second-to-last dimension has stride 77, you can flatten (using
reshape, in python, to get a view) the last two dimensions of the
array and use a nice quick iterator on the combined dimension. For a
"C contiguous" array, this means you just walk through the whole array
with a single layer of looping.
I suspect that the place you will see big slowdowns is where data is
accessed in an order that doesn't make sense from a cache point of
view. If four-byte chunks of data are spaced every eight bytes, then
the processor spends twice as much time waiting for cache loads as if
they are spaced every four bytes. And most numpy tasks are almost
certainly memory-bound - all the ufuncs I can think of almost
certainly are (arctan of a double almost certainly takes less time
than loading and storing a double). If you walk through an array in
the wrong order - changing the big strides fastest and the small
strides slowest - you'll almost certainly have to shuffle the whole
array through cache eight times or more. This is even ignoring cache
misses. (To cut down on cache misses you can use the GCC prefetch
instructions, or god-knows-what equivalent in other compilers.)
I seem to recall that numpy already reorders its walks through arrays
within ufuncs - does it do so with cache in mind? For that matter, is
information on the CPU's caches available to numpy?
Anne
P.S. Here is code to flatten an array as much as possible without
changing the order of flatiter:
def flatteniter(A):
i=0
while i
Thank you for the response. I'm afraid I haven't explained what I'm doing.
I have a lot of c++ code written in a generic c++ interface style. (The
code is for signal processing, but that is irrelevant).
The code is generic for data types as well as container types.
To accomplish this, the interface uses the boost::range interface concept.
An example of using this interface:
template<typename in_t>
void F (in_t const& in) {
typename boost::range_const_iterator
Neal Becker wrote:
This is a c-api question.
I'm trying to get iterators that are both fast and reasonably general. I did confirm that iterating using just the general PyArrayIterObject protocol is not as fast as using c-style pointers for contiguous arrays.
Please confirm if my understanding is correct. There are 2 cases to consider. There are plain old dense arrays, which will be contiguous, and there are array 'views' or 'slices'. The views will not be contiguous. However, in all cases, the strides between elements in one dimension are constant. This last point is key. As long as the stride in the last dimension is a constant, a fast iterator is possible.
Your understanding is correct. You can use the iterator over all but one dimension. This is a common paradigm in NumPy (use an iterator over all but one dimension (usually chosen to be the one with the smallest striding) and then use strided access over the final dimension (see all the ufuncs for examples). IterAllButAxis takes a pointer to an integer. If that integer is negative on input, then IterAllButAxis will choose the axis for you to be the one with the smallest stride, then it will set the variable to the axis that was chosen. This is just what ufuncs do and is a pretty good idea generally to minimize the over-head of the final inner loop. -Travis O.
participants (3)
-
Anne Archibald
-
Neal Becker
-
Travis E. Oliphant