Re: [Numpy-discussion] Benchmak on record arrays

28 May 2009

      A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
...
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64)
Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64),
('y',numpy.int32)])
Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64),
('y',numpy.bool)])
t = time.clock()
for i in range(10000): Z*Z
print time.clock()-t
t = time.clock()
for i in range(10000): Z_fast['x']*Z_fast['x']
print time.clock()-t
t = time.clock()
for i in range(10000): Z_slow['x']*Z_slow['x']
print time.clock()-t
And got the following results:
0.23
0.37
3.96
Am I right in thinking that the last case is quite slow because of some
memory misalignment between float64 and bool or is there some machinery
behind that makes things slow in this case ? Should this be mentioned
somewhere in the recarray documentation ?
Yes, I can reproduce your results, and I must admit that a 10x slowdown is a 
lot.  However, I think that this affects mostly to small record arrays (i.e. 
those that fit in CPU cache), and mainly in benchmarks (precisely because they 
fit well in cache).  You can simulate a more real-life scenario by defining a 
large recarray that do not fit in CPU's cache.  For example:

In [17]: Z = np.zeros((1000,1000), dtype=np.float64)  # 8 MB object

In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64),
('y',np.int64)])   # 16 MB object

In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64),
('y',np.bool)])  # 9 MB object

In [20]: x_fast = Z_fast['x']
In [21]: timeit x_fast * x_fast
100 loops, best of 3: 5.48 ms per loop

In [22]: x_slow = Z_slow['x']

In [23]: timeit x_slow * x_slow
100 loops, best of 3: 14.4 ms per loop

So, the slowdown is less than 3x, which is a more reasonable figure.  If you 
need optimal speed for operating with unaligned columns, you can use numexpr.  
Here it is an example of what you can expect from it:

In [24]: import numexpr as nx

In [25]: timeit nx.evaluate('x_slow * x_slow')
100 loops, best of 3: 11.1 ms per loop

So, the slowdown is just 2x instead of 3x, which is near optimal for the 
unaligned case.

Numexpr also seems to help for small recarrays that fits in cache (i.e. for 
benchmarking purposes ;) :

# Create a 160 KB object
In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)])
# Create a 110 KB object
In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)])

In [28]: x_fast = Z_fast['x']

In [29]: timeit x_fast * x_fast
10000 loops, best of 3: 20.7 µs per loop

In [30]: x_slow = Z_slow['x']

In [31]: timeit x_slow * x_slow
10000 loops, best of 3: 149 µs per loop

In [32]: timeit nx.evaluate('x_slow * x_slow')
10000 loops, best of 3: 45.3 µs per loop

Hope that helps,

-- 
Francesc Alted