
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()-t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()-t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()-t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ? Should this be mentioned somewhere in the recarray documentation ?
Nicolas

On Wed, May 27, 2009 at 9:31 AM, Nicolas Rougier Nicolas.Rougier@loria.frwrote:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()-t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()-t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()-t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ?
Probably. Record arrays are stored like packed c structures and need to be unpacked by copying the bytes to aligned data types.
Should this be mentioned somewhere in the recarray documentation ?
A note would be appropriate, yes. You should be able to do that, do you have edit permissions for the documentation?
Chuck

No, I don't have permission to edit.
Nicolas
On 27 May, 2009, at 18:01 , Charles R Harris wrote:
On Wed, May 27, 2009 at 9:31 AM, Nicolas Rougier <Nicolas.Rougier@loria.fr
wrote:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()-t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()-t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()-t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ?
Probably. Record arrays are stored like packed c structures and need to be unpacked by copying the bytes to aligned data types.
Should this be mentioned somewhere in the recarray documentation ?
A note would be appropriate, yes. You should be able to do that, do you have edit permissions for the documentation?
Chuck
Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, May 27, 2009 at 1:21 PM, Nicolas Rougier Nicolas.Rougier@loria.frwrote:
No, I don't have permission to edit. Nicolas
You should ask for it then. Email stephan at stefan@sun.ac.za. The docs are here http://docs.scipy.org/numpy/Front%20Page/.
Chuck

Hi Nicolas
2009/5/27 Nicolas Rougier Nicolas.Rougier@loria.fr:
No, I don't have permission to edit.
Thanks for helping out with the docs! Please create an account on docs.scipy.org and give me a shout when you're done.
Cheers Stéfan

I just created the account.
Nicolas
On Thu, 2009-05-28 at 11:21 +0200, Stéfan van der Walt wrote:
Hi Nicolas
2009/5/27 Nicolas Rougier Nicolas.Rougier@loria.fr:
No, I don't have permission to edit.
Thanks for helping out with the docs! Please create an account on docs.scipy.org and give me a shout when you're done.
Cheers Stéfan _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Thanks, Nicolas. Your username has been changed to "NicolasRougier" and you can now edit the docs.
Regards Stéfan
2009/5/28 Nicolas Rougier Nicolas.Rougier@loria.fr:
I just created the account.
Nicolas

A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()-t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()-t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()-t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ? Should this be mentioned somewhere in the recarray documentation ?
Yes, I can reproduce your results, and I must admit that a 10x slowdown is a lot. However, I think that this affects mostly to small record arrays (i.e. those that fit in CPU cache), and mainly in benchmarks (precisely because they fit well in cache). You can simulate a more real-life scenario by defining a large recarray that do not fit in CPU's cache. For example:
In [17]: Z = np.zeros((1000,1000), dtype=np.float64) # 8 MB object
In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.int64)]) # 16 MB object
In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.bool)]) # 9 MB object
In [20]: x_fast = Z_fast['x'] In [21]: timeit x_fast * x_fast 100 loops, best of 3: 5.48 ms per loop
In [22]: x_slow = Z_slow['x']
In [23]: timeit x_slow * x_slow 100 loops, best of 3: 14.4 ms per loop
So, the slowdown is less than 3x, which is a more reasonable figure. If you need optimal speed for operating with unaligned columns, you can use numexpr. Here it is an example of what you can expect from it:
In [24]: import numexpr as nx In [25]: timeit nx.evaluate('x_slow * x_slow') 100 loops, best of 3: 11.1 ms per loop
So, the slowdown is just 2x instead of 3x, which is near optimal for the unaligned case.
Numexpr also seems to help for small recarrays that fits in cache (i.e. for benchmarking purposes ;) :
# Create a 160 KB object In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)]) # Create a 110 KB object In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)])
In [28]: x_fast = Z_fast['x']
In [29]: timeit x_fast * x_fast 10000 loops, best of 3: 20.7 µs per loop
In [30]: x_slow = Z_slow['x']
In [31]: timeit x_slow * x_slow 10000 loops, best of 3: 149 µs per loop
In [32]: timeit nx.evaluate('x_slow * x_slow') 10000 loops, best of 3: 45.3 µs per loop
Hope that helps,

Thank for the clear answer, it definitely helps.
Nicolas
On Thu, 2009-05-28 at 19:25 +0200, Francesc Alted wrote:
A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
Hi,
I've written a very simple benchmark on recarrays:
import numpy, time
Z = numpy.zeros((100,100), dtype=numpy.float64) Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.int32)]) Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64), ('y',numpy.bool)])
t = time.clock() for i in range(10000): Z*Z print time.clock()-t
t = time.clock() for i in range(10000): Z_fast['x']*Z_fast['x'] print time.clock()-t
t = time.clock() for i in range(10000): Z_slow['x']*Z_slow['x'] print time.clock()-t
And got the following results: 0.23 0.37 3.96
Am I right in thinking that the last case is quite slow because of some memory misalignment between float64 and bool or is there some machinery behind that makes things slow in this case ? Should this be mentioned somewhere in the recarray documentation ?
Yes, I can reproduce your results, and I must admit that a 10x slowdown is a lot. However, I think that this affects mostly to small record arrays (i.e. those that fit in CPU cache), and mainly in benchmarks (precisely because they fit well in cache). You can simulate a more real-life scenario by defining a large recarray that do not fit in CPU's cache. For example:
In [17]: Z = np.zeros((1000,1000), dtype=np.float64) # 8 MB object
In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.int64)]) # 16 MB object
In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64), ('y',np.bool)]) # 9 MB object
In [20]: x_fast = Z_fast['x'] In [21]: timeit x_fast * x_fast 100 loops, best of 3: 5.48 ms per loop
In [22]: x_slow = Z_slow['x']
In [23]: timeit x_slow * x_slow 100 loops, best of 3: 14.4 ms per loop
So, the slowdown is less than 3x, which is a more reasonable figure. If you need optimal speed for operating with unaligned columns, you can use numexpr. Here it is an example of what you can expect from it:
In [24]: import numexpr as nx
In [25]: timeit nx.evaluate('x_slow * x_slow') 100 loops, best of 3: 11.1 ms per loop
So, the slowdown is just 2x instead of 3x, which is near optimal for the unaligned case.
Numexpr also seems to help for small recarrays that fits in cache (i.e. for benchmarking purposes ;) :
# Create a 160 KB object In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)]) # Create a 110 KB object In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)])
In [28]: x_fast = Z_fast['x']
In [29]: timeit x_fast * x_fast 10000 loops, best of 3: 20.7 µs per loop
In [30]: x_slow = Z_slow['x']
In [31]: timeit x_slow * x_slow 10000 loops, best of 3: 149 µs per loop
In [32]: timeit nx.evaluate('x_slow * x_slow') 10000 loops, best of 3: 45.3 µs per loop
Hope that helps,
participants (4)
-
Charles R Harris
-
Francesc Alted
-
Nicolas Rougier
-
Stéfan van der Walt