[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

David Cournapeau david at ar.media.kyoto-u.ac.jp
Fri Nov 9 01:30:45 EST 2007

Anne Archibald wrote:
> On 08/11/2007, David Cournapeau <david at ar.media.kyoto-u.ac.jp> wrote:
>> For copy and array creation, I understand this, but for element-wise
>> operations (mean, min, and max), this is not enough to explain the
>> difference, no ? For example, I can understand a 50 % or 100 % time
>> increase for simple operations (by simple, I mean one element operation
>> taking only a few CPU cycles), because of copies, but a 5 fold time
>> increase seems too big, no (mayb a cache problem, though) ? Also, the
>> fact that mean is slower than min/max for both cases (F vs C) seems a
>> bit counterintuitive (maybe cache effects are involved somehow ?).
> I have no doubt at all that cache effects are involved: for an int
> array, each data element is four bytes, but typical CPUs need to load
> 64 bytes at a time into cache. If you read (or write) the rest of
> those bytes in the next iterations through the loop, the (large) cost
> of a memory read is amortized. If you jump to the next row of the
> array, some large number of bytes away, those 64 bytes basically need
> to be purged to make room for another 64 bytes, of which you'll use 4.
> If you're reading from a FORTRAN-order array and writing to a C-order
> one, there's no way around doing this on one end or another: you're
> effectively doing a transpose, which is pretty much always expensive.
This is obviously part of the picture, and there is indeed no way around 
the simple fact that on sequential memory access is extremely slow on 
modern hardware. The fact that the array iterator does not support (yet) 
the Fortran order would largely explain this (I wrongly assumed the 
iterator already did that).
> Is there any reason not to let ufuncs pick whichever order for
> newly-allocated arrays they want? The natural choice would be the same
> as the bigger input array, if it's a contiguous block of memory
> (whether or not the contiguous flags are set). Failing that, the same
> as the other input array (if any); failing that, C order's as good a
> default as any. How difficult would this be to implement?
I think part of the problem is that it is difficult to know which is 
faster a priori (is copying multi-segmented parts in a single buffer 
faster for processing faster than direct processing ?). As mentioned 
previously, on the same platform (same OS/compiler combination), there 
are already huge differences between two CPU (pentium4 vs pentium M, the 
former being noticeably pig-slow when SSE FPU is not used).

Does ufunc use buffer for segmented memory, or do they only use them for 
misbehaved arrays ? Also, from my very limited experience, I found array 
iterators to be significantly slower than simple array indexing on 
contiguous, single segment arrays. Do ufunc always use array iterator, 
or do they use simple array indexing when possible ? When I implemented 
the fast clip function (which does not use ufunc), I noticed that the 
following matter:
    - avoiding unnecessary copies.
    - quickly determine which cases can be optimized wrt to the 
operations you are using (memmove vs memcpy, etc...)
    - fast memory copying (numpy do not have those yet: by using 
optimized memory copying, you can often gain a 50 % speed increase on 
recent archs; also, having information about alignment could help, too, 
and we go back to aligned allocators discussion :) ).

The problem is to find a good API to put those optimizations at one 
place. I unfortunately do not know much about ufunc, maybe they already 
implement most of those.



More information about the NumPy-Discussion mailing list