[Numpy-discussion] Benchmak on record arrays
Nicolas Rougier
Nicolas.Rougier at loria.fr
Fri May 29 07:53:12 EDT 2009
Thank for the clear answer, it definitely helps.
Nicolas
On Thu, 2009-05-28 at 19:25 +0200, Francesc Alted wrote:
> A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
> > Hi,
> >
> > I've written a very simple benchmark on recarrays:
> >
> > import numpy, time
> >
> > Z = numpy.zeros((100,100), dtype=numpy.float64)
> > Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64),
> > ('y',numpy.int32)])
> > Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64),
> > ('y',numpy.bool)])
> >
> > t = time.clock()
> > for i in range(10000): Z*Z
> > print time.clock()-t
> >
> > t = time.clock()
> > for i in range(10000): Z_fast['x']*Z_fast['x']
> > print time.clock()-t
> >
> > t = time.clock()
> > for i in range(10000): Z_slow['x']*Z_slow['x']
> > print time.clock()-t
> >
> >
> > And got the following results:
> > 0.23
> > 0.37
> > 3.96
> >
> > Am I right in thinking that the last case is quite slow because of some
> > memory misalignment between float64 and bool or is there some machinery
> > behind that makes things slow in this case ? Should this be mentioned
> > somewhere in the recarray documentation ?
>
> Yes, I can reproduce your results, and I must admit that a 10x slowdown is a
> lot. However, I think that this affects mostly to small record arrays (i.e.
> those that fit in CPU cache), and mainly in benchmarks (precisely because they
> fit well in cache). You can simulate a more real-life scenario by defining a
> large recarray that do not fit in CPU's cache. For example:
>
> In [17]: Z = np.zeros((1000,1000), dtype=np.float64) # 8 MB object
>
> In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64),
> ('y',np.int64)]) # 16 MB object
>
> In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64),
> ('y',np.bool)]) # 9 MB object
>
> In [20]: x_fast = Z_fast['x']
> In [21]: timeit x_fast * x_fast
> 100 loops, best of 3: 5.48 ms per loop
>
> In [22]: x_slow = Z_slow['x']
>
> In [23]: timeit x_slow * x_slow
> 100 loops, best of 3: 14.4 ms per loop
>
> So, the slowdown is less than 3x, which is a more reasonable figure. If you
> need optimal speed for operating with unaligned columns, you can use numexpr.
> Here it is an example of what you can expect from it:
>
> In [24]: import numexpr as nx
>
> In [25]: timeit nx.evaluate('x_slow * x_slow')
> 100 loops, best of 3: 11.1 ms per loop
>
> So, the slowdown is just 2x instead of 3x, which is near optimal for the
> unaligned case.
>
> Numexpr also seems to help for small recarrays that fits in cache (i.e. for
> benchmarking purposes ;) :
>
> # Create a 160 KB object
> In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)])
> # Create a 110 KB object
> In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)])
>
> In [28]: x_fast = Z_fast['x']
>
> In [29]: timeit x_fast * x_fast
> 10000 loops, best of 3: 20.7 µs per loop
>
> In [30]: x_slow = Z_slow['x']
>
> In [31]: timeit x_slow * x_slow
> 10000 loops, best of 3: 149 µs per loop
>
> In [32]: timeit nx.evaluate('x_slow * x_slow')
> 10000 loops, best of 3: 45.3 µs per loop
>
> Hope that helps,
>
More information about the NumPy-Discussion
mailing list