A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()-t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()-t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()-t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ? Should this be mentioned somewhere in the recarray documentation ?
Yes, I can reproduce your results, and I must admit that a 10x slowdown is a lot. However, I think that this affects mostly to small record arrays (i.e. those that fit in CPU cache), and mainly in benchmarks (precisely because they fit well in cache). You can simulate a more real-life scenario by defining a large recarray that do not fit in CPU's cache. For example: In [17]: Z = np.zeros((1000,1000), dtype=np.float64) # 8 MB object In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.int64)]) # 16 MB object In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.bool)]) # 9 MB object In [20]: x_fast = Z_fast['x'] In [21]: timeit x_fast * x_fast 100 loops, best of 3: 5.48 ms per loop In [22]: x_slow = Z_slow['x'] In [23]: timeit x_slow * x_slow 100 loops, best of 3: 14.4 ms per loop So, the slowdown is less than 3x, which is a more reasonable figure. If you need optimal speed for operating with unaligned columns, you can use numexpr. Here it is an example of what you can expect from it: In [24]: import numexpr as nx In [25]: timeit nx.evaluate('x_slow * x_slow') 100 loops, best of 3: 11.1 ms per loop So, the slowdown is just 2x instead of 3x, which is near optimal for the unaligned case. Numexpr also seems to help for small recarrays that fits in cache (i.e. for benchmarking purposes ;) : # Create a 160 KB object In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)]) # Create a 110 KB object In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)]) In [28]: x_fast = Z_fast['x'] In [29]: timeit x_fast * x_fast 10000 loops, best of 3: 20.7 µs per loop In [30]: x_slow = Z_slow['x'] In [31]: timeit x_slow * x_slow 10000 loops, best of 3: 149 µs per loop In [32]: timeit nx.evaluate('x_slow * x_slow') 10000 loops, best of 3: 45.3 µs per loop Hope that helps, -- Francesc Alted