[Numpy-discussion] Benchmak on record arrays

Fri May 29 07:53:12 EDT 2009

Thank for the clear answer, it definitely helps.

Nicolas

On Thu, 2009-05-28 at 19:25 +0200, Francesc Alted wrote:
> A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
> > Hi,
> >
> > I've written a very simple benchmark on recarrays:
> >
> > import numpy, time
> >
> > Z = numpy.zeros((100,100), dtype=numpy.float64)
> > Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64),
> > ('y',numpy.int32)])
> > Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64),
> > ('y',numpy.bool)])
> >
> > t = time.clock()
> > for i in range(10000): Z*Z
> > print time.clock()-t
> >
> > t = time.clock()
> > for i in range(10000): Z_fast['x']*Z_fast['x']
> > print time.clock()-t
> >
> > t = time.clock()
> > for i in range(10000): Z_slow['x']*Z_slow['x']
> > print time.clock()-t
> >
> >
> > And got the following results:
> > 0.23
> > 0.37
> > 3.96
> >
> > Am I right in thinking that the last case is quite slow because of some
> > memory misalignment between float64 and bool or is there some machinery
> > behind that makes things slow in this case ? Should this be mentioned
> > somewhere in the recarray documentation ?
> 
> Yes, I can reproduce your results, and I must admit that a 10x slowdown is a 
> lot.  However, I think that this affects mostly to small record arrays (i.e. 
> those that fit in CPU cache), and mainly in benchmarks (precisely because they 
> fit well in cache).  You can simulate a more real-life scenario by defining a 
> large recarray that do not fit in CPU's cache.  For example:
> 
> In [17]: Z = np.zeros((1000,1000), dtype=np.float64)  # 8 MB object
> 
> In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64),
> ('y',np.int64)])   # 16 MB object
> 
> In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64),
> ('y',np.bool)])  # 9 MB object
> 
> In [20]: x_fast = Z_fast['x']
> In [21]: timeit x_fast * x_fast
> 100 loops, best of 3: 5.48 ms per loop
> 
> In [22]: x_slow = Z_slow['x']
> 
> In [23]: timeit x_slow * x_slow
> 100 loops, best of 3: 14.4 ms per loop
> 
> So, the slowdown is less than 3x, which is a more reasonable figure.  If you 
> need optimal speed for operating with unaligned columns, you can use numexpr.  
> Here it is an example of what you can expect from it:
> 
> In [24]: import numexpr as nx
> 	
> In [25]: timeit nx.evaluate('x_slow * x_slow')
> 100 loops, best of 3: 11.1 ms per loop
> 
> So, the slowdown is just 2x instead of 3x, which is near optimal for the 
> unaligned case.
> 
> Numexpr also seems to help for small recarrays that fits in cache (i.e. for 
> benchmarking purposes ;) :
> 
> # Create a 160 KB object
> In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)])
> # Create a 110 KB object
> In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)])
> 
> In [28]: x_fast = Z_fast['x']
> 
> In [29]: timeit x_fast * x_fast
> 10000 loops, best of 3: 20.7 µs per loop
> 
> In [30]: x_slow = Z_slow['x']
> 
> In [31]: timeit x_slow * x_slow
> 10000 loops, best of 3: 149 µs per loop
> 
> In [32]: timeit nx.evaluate('x_slow * x_slow')
> 10000 loops, best of 3: 45.3 µs per loop
> 
> Hope that helps,
>