
On 08/11/2007, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?).
I have no doubt at all that cache effects are involved: for an int array, each data element is four bytes, but typical CPUs need to load 64 bytes at a time into cache. If you read (or write) the rest of those bytes in the next iterations through the loop, the (large) cost of a memory read is amortized. If you jump to the next row of the array, some large number of bytes away, those 64 bytes basically need to be purged to make room for another 64 bytes, of which you'll use 4. If you're reading from a FORTRAN-order array and writing to a C-order one, there's no way around doing this on one end or another: you're effectively doing a transpose, which is pretty much always expensive. Is there any reason not to let ufuncs pick whichever order for newly-allocated arrays they want? The natural choice would be the same as the bigger input array, if it's a contiguous block of memory (whether or not the contiguous flags are set). Failing that, the same as the other input array (if any); failing that, C order's as good a default as any. How difficult would this be to implement? Anne