Hi, I've written a very simple benchmark on recarrays: import numpy, time Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)]) t = time.clock() for i in range(10000): Z*Z print time.clock()t t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()t t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()t And got the following results: 0.23 0.37 3.96 Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ? Should this be mentioned somewhere in the recarray documentation ? Nicolas
On Wed, May 27, 2009 at 9:31 AM, Nicolas Rougier <Nicolas.Rougier@loria.fr>wrote:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ?
Probably. Record arrays are stored like packed c structures and need to be unpacked by copying the bytes to aligned data types.
Should this be mentioned somewhere in the recarray documentation ?
A note would be appropriate, yes. You should be able to do that, do you have edit permissions for the documentation? Chuck
No, I don't have permission to edit. Nicolas On 27 May, 2009, at 18:01 , Charles R Harris wrote:
On Wed, May 27, 2009 at 9:31 AM, Nicolas Rougier <Nicolas.Rougier@loria.fr
wrote:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ?
Probably. Record arrays are stored like packed c structures and need to be unpacked by copying the bytes to aligned data types.
Should this be mentioned somewhere in the recarray documentation ?
A note would be appropriate, yes. You should be able to do that, do you have edit permissions for the documentation?
Chuck
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Wed, May 27, 2009 at 1:21 PM, Nicolas Rougier <Nicolas.Rougier@loria.fr>wrote:
No, I don't have permission to edit. Nicolas
You should ask for it then. Email stephan at <stefan@sun.ac.za>. The docs are here <http://docs.scipy.org/numpy/Front%20Page/>. Chuck
I just created the account. Nicolas On Thu, 20090528 at 11:21 +0200, Stéfan van der Walt wrote:
Hi Nicolas
2009/5/27 Nicolas Rougier <Nicolas.Rougier@loria.fr>:
No, I don't have permission to edit.
Thanks for helping out with the docs! Please create an account on docs.scipy.org and give me a shout when you're done.
Cheers Stéfan _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ? Should this be mentioned somewhere in the recarray documentation ?
Yes, I can reproduce your results, and I must admit that a 10x slowdown is a lot. However, I think that this affects mostly to small record arrays (i.e. those that fit in CPU cache), and mainly in benchmarks (precisely because they fit well in cache). You can simulate a more reallife scenario by defining a large recarray that do not fit in CPU's cache. For example: In [17]: Z = np.zeros((1000,1000), dtype=np.float64) # 8 MB object In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.int64)]) # 16 MB object In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.bool)]) # 9 MB object In [20]: x_fast = Z_fast['x'] In [21]: timeit x_fast * x_fast 100 loops, best of 3: 5.48 ms per loop In [22]: x_slow = Z_slow['x'] In [23]: timeit x_slow * x_slow 100 loops, best of 3: 14.4 ms per loop So, the slowdown is less than 3x, which is a more reasonable figure. If you need optimal speed for operating with unaligned columns, you can use numexpr. Here it is an example of what you can expect from it: In [24]: import numexpr as nx In [25]: timeit nx.evaluate('x_slow * x_slow') 100 loops, best of 3: 11.1 ms per loop So, the slowdown is just 2x instead of 3x, which is near optimal for the unaligned case. Numexpr also seems to help for small recarrays that fits in cache (i.e. for benchmarking purposes ;) : # Create a 160 KB object In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)]) # Create a 110 KB object In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)]) In [28]: x_fast = Z_fast['x'] In [29]: timeit x_fast * x_fast 10000 loops, best of 3: 20.7 µs per loop In [30]: x_slow = Z_slow['x'] In [31]: timeit x_slow * x_slow 10000 loops, best of 3: 149 µs per loop In [32]: timeit nx.evaluate('x_slow * x_slow') 10000 loops, best of 3: 45.3 µs per loop Hope that helps,  Francesc Alted
Thank for the clear answer, it definitely helps. Nicolas On Thu, 20090528 at 19:25 +0200, Francesc Alted wrote:
A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ? Should this be mentioned somewhere in the recarray documentation ?
Yes, I can reproduce your results, and I must admit that a 10x slowdown is a lot. However, I think that this affects mostly to small record arrays (i.e. those that fit in CPU cache), and mainly in benchmarks (precisely because they fit well in cache). You can simulate a more reallife scenario by defining a large recarray that do not fit in CPU's cache. For example:
In [17]: Z = np.zeros((1000,1000), dtype=np.float64) # 8 MB object
In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.int64)]) # 16 MB object
In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.bool)]) # 9 MB object
In [20]: x_fast = Z_fast['x'] In [21]: timeit x_fast * x_fast 100 loops, best of 3: 5.48 ms per loop
In [22]: x_slow = Z_slow['x']
In [23]: timeit x_slow * x_slow 100 loops, best of 3: 14.4 ms per loop
So, the slowdown is less than 3x, which is a more reasonable figure. If you need optimal speed for operating with unaligned columns, you can use numexpr. Here it is an example of what you can expect from it:
In [24]: import numexpr as nx
In [25]: timeit nx.evaluate('x_slow * x_slow') 100 loops, best of 3: 11.1 ms per loop
So, the slowdown is just 2x instead of 3x, which is near optimal for the unaligned case.
Numexpr also seems to help for small recarrays that fits in cache (i.e. for benchmarking purposes ;) :
# Create a 160 KB object In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)]) # Create a 110 KB object In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)])
In [28]: x_fast = Z_fast['x']
In [29]: timeit x_fast * x_fast 10000 loops, best of 3: 20.7 µs per loop
In [30]: x_slow = Z_slow['x']
In [31]: timeit x_slow * x_slow 10000 loops, best of 3: 149 µs per loop
In [32]: timeit nx.evaluate('x_slow * x_slow') 10000 loops, best of 3: 45.3 µs per loop
Hope that helps,
participants (4)

Charles R Harris

Francesc Alted

Nicolas Rougier

Stéfan van der Walt