[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

Thu Nov 8 18:16:12 EST 2007

Hans Meine wrote:
> Hi!
>
> I wonder why simple elementwise operations like "a * 2" or "a + 1" are not 
> performed in order of increasing memory addresses in order to exploit CPU 
> caches etc.
C-order is "special" in NumPy due to the history.  I agree that it 
doesn't need to be and we have taken significant steps in that 
direction.   Right now, the fundamental problem is probably due to the 
fact that the output arrays are being created as C-order arrays when the 
input is a Fortran order array.  Once that is changed then we can cause 
Fortran-order inputs to use the simplest path through the code.

Fortran order arrays can be preserved but it takes a little extra work 
because backward compatible expectations had to be met.  See for example 
the order argument to the copy method of arrays.

>  - as it is now, their speed drops by a factor of around 3 simply 
> by transpose()ing.  Similarly (but even less logical), copy() and even the 
> constructor are affected (yes, I understand that copy() creates contiguous 
> arrays, but shouldn't it respect/retain the order nevertheless?):
>   
As mentioned, it can preserve order with the 'order' argument

a.copy('A')

> ### constructor ###
> In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3))
> 1000000 loops, best of 10: 1.19 s per loop
>
> In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f")
> 1000000 loops, best of 10: 2.19 s per loop
>
>   

I bet what you are seeing here is simply the overhead of processing the 
order argument. 

Try the first one with order='c'  to see what I mean.

> ### copy 3x3x3 array ###
> In [85]: a = numpy.ndarray((3,3,3))
>
> In [86]: %timeit -r 10 a.copy()
> 1000000 loops, best of 10: 1.14 s per loop
>
> In [87]: a = numpy.ndarray((3,3,3), order="f")
>
> In [88]: %timeit -r 10 -n 1000000 a.copy()
> 1000000 loops, best of 10: 3.39 s per loop
>   

Use the 'a' argument to allow copying in "fortran" order.
> ### copy 256x256x256 array ###
> In [74]: a = numpy.ndarray((256,256,256))
>
> In [75]: %timeit -r 10 a.copy()
> 10 loops, best of 10: 119 ms per loop
>
> In [76]: a = numpy.ndarray((256,256,256), order="f")
>
> In [77]: %timeit -r 10 a.copy()
> 10 loops, best of 10: 274 ms per loop
>   
Same thing here.   Nobody is opposed to having faster code as long as we 
don't in the process break code bases.   There is also the matter of 
implementation....

-Travis O.