[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Travis E. Oliphant
oliphant at enthought.com
Thu Nov 8 18:16:12 EST 2007
Hans Meine wrote:
> Hi!
>
> I wonder why simple elementwise operations like "a * 2" or "a + 1" are not
> performed in order of increasing memory addresses in order to exploit CPU
> caches etc.
C-order is "special" in NumPy due to the history. I agree that it
doesn't need to be and we have taken significant steps in that
direction. Right now, the fundamental problem is probably due to the
fact that the output arrays are being created as C-order arrays when the
input is a Fortran order array. Once that is changed then we can cause
Fortran-order inputs to use the simplest path through the code.
Fortran order arrays can be preserved but it takes a little extra work
because backward compatible expectations had to be met. See for example
the order argument to the copy method of arrays.
> - as it is now, their speed drops by a factor of around 3 simply
> by transpose()ing. Similarly (but even less logical), copy() and even the
> constructor are affected (yes, I understand that copy() creates contiguous
> arrays, but shouldn't it respect/retain the order nevertheless?):
>
As mentioned, it can preserve order with the 'order' argument
a.copy('A')
> ### constructor ###
> In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3))
> 1000000 loops, best of 10: 1.19 s per loop
>
> In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f")
> 1000000 loops, best of 10: 2.19 s per loop
>
>
I bet what you are seeing here is simply the overhead of processing the
order argument.
Try the first one with order='c' to see what I mean.
> ### copy 3x3x3 array ###
> In [85]: a = numpy.ndarray((3,3,3))
>
> In [86]: %timeit -r 10 a.copy()
> 1000000 loops, best of 10: 1.14 s per loop
>
> In [87]: a = numpy.ndarray((3,3,3), order="f")
>
> In [88]: %timeit -r 10 -n 1000000 a.copy()
> 1000000 loops, best of 10: 3.39 s per loop
>
Use the 'a' argument to allow copying in "fortran" order.
> ### copy 256x256x256 array ###
> In [74]: a = numpy.ndarray((256,256,256))
>
> In [75]: %timeit -r 10 a.copy()
> 10 loops, best of 10: 119 ms per loop
>
> In [76]: a = numpy.ndarray((256,256,256), order="f")
>
> In [77]: %timeit -r 10 a.copy()
> 10 loops, best of 10: 274 ms per loop
>
Same thing here. Nobody is opposed to having faster code as long as we
don't in the process break code bases. There is also the matter of
implementation....
-Travis O.
More information about the NumPy-Discussion
mailing list