Hi!
I wonder why simple elementwise operations like "a * 2" or "a + 1" are not
performed in order of increasing memory addresses in order to exploit CPU
caches etc. - as it is now, their speed drops by a factor of around 3 simply
by transpose()ing. Similarly (but even less logical), copy() and even the
constructor are affected (yes, I understand that copy() creates contiguous
arrays, but shouldn't it respect/retain the order nevertheless?):
### constructor ###
In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3))
1000000 loops, best of 10: 1.19 s per loop
In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f")
1000000 loops, best of 10: 2.19 s per loop
### copy 3x3x3 array ###
In [85]: a = numpy.ndarray((3,3,3))
In [86]: %timeit -r 10 a.copy()
1000000 loops, best of 10: 1.14 s per loop
In [87]: a = numpy.ndarray((3,3,3), order="f")
In [88]: %timeit -r 10 -n 1000000 a.copy()
1000000 loops, best of 10: 3.39 s per loop
### copy 256x256x256 array ###
In [74]: a = numpy.ndarray((256,256,256))
In [75]: %timeit -r 10 a.copy()
10 loops, best of 10: 119 ms per loop
In [76]: a = numpy.ndarray((256,256,256), order="f")
In [77]: %timeit -r 10 a.copy()
10 loops, best of 10: 274 ms per loop
### fill ###
In [79]: a = numpy.ndarray((256,256,256))
In [80]: %timeit -r 10 a.fill(0)
10 loops, best of 10: 60.2 ms per loop
In [81]: a = numpy.ndarray((256,256,256), order="f")
In [82]: %timeit -r 10 a.fill(0)
10 loops, best of 10: 60.2 ms per loop
### power ###
In [151]: a = numpy.ndarray((256,256,256))
In [152]: %timeit -r 10 a ** 2
10 loops, best of 10: 124 ms per loop
In [153]: a = numpy.asfortranarray(a)
In [154]: %timeit -r 10 a ** 2
10 loops, best of 10: 458 ms per loop
### addition ###
In [160]: a = numpy.ndarray((256,256,256))
In [161]: %timeit -r 10 a + 1
10 loops, best of 10: 139 ms per loop
In [162]: a = numpy.asfortranarray(a)
In [163]: %timeit -r 10 a + 1
10 loops, best of 10: 465 ms per loop
### fft ###
In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0)
10 loops, best of 10: 1.16 s per loop
In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2)
10 loops, best of 10: 1.16 s per loop
In [149]: vol.flags
Out[149]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [150]: vol0.flags
Out[150]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0)
10 loops, best of 10: 939 ms per loop
### mean ###
In [173]: %timeit -r 10 vol.mean()
10 loops, best of 10: 272 ms per loop
In [174]: %timeit -r 10 vol0.mean()
10 loops, best of 10: 683 ms per loop
### max ###
In [175]: %timeit -r 10 vol.max()
10 loops, best of 10: 63.8 ms per loop
In [176]: %timeit -r 10 vol0.max()
10 loops, best of 10: 475 ms per loop
### min ###
In [177]: %timeit -r 10 vol.min()
10 loops, best of 10: 63.8 ms per loop
In [178]: %timeit -r 10 vol0.min()
10 loops, best of 10: 476 ms per loop
### rot90 ###
In [10]: %timeit -r 10 numpy.rot90(vol)
100000 loops, best of 10: 6.97 s per loop
In [12]: %timeit -r 10 numpy.rot90(vol0)
100000 loops, best of 10: 6.92 s per loop
--
Ciao, / /
/--/
/ / ANS