[Numpy-discussion] question about in-place operations

Francesc Alted francesc at continuum.io
Thu May 24 04:32:43 EDT 2012


On 5/22/12 9:08 PM, Massimo Di Pierro wrote:
> This problem is linear so probably Ram IO bound. I do not think I
> would benefit much for multiple cores. But I will give it a try. In
> the short term this is good enough for me.

Yeah, this what common sense seems to indicate, that RAM IO bound 
problems do not benefit from using multiple cores.  But reality is 
different:

 >>> import numpy as np
 >>> a = np.arange(1e8)
 >>> c = 1.0
 >>> time a*c
CPU times: user 0.22 s, sys: 0.20 s, total: 0.43 s
Wall time: 0.43 s
array([  0.00000000e+00,   1.00000000e+00,   2.00000000e+00, ...,
          9.99999970e+07,   9.99999980e+07,   9.99999990e+07])

Using numexpr with 1 thread:

 >>> import numexpr as ne
 >>> ne.set_num_threads(1)
8
 >>> time ne.evaluate("a*c")
CPU times: user 0.20 s, sys: 0.25 s, total: 0.45 s
Wall time: 0.45 s
array([  0.00000000e+00,   1.00000000e+00,   2.00000000e+00, ...,
          9.99999970e+07,   9.99999980e+07,   9.99999990e+07])

while using 8 threads (the machine has 8 physical cores):

 >>> ne.set_num_threads(8)
1
 >>> time ne.evaluate("a*c")
CPU times: user 0.39 s, sys: 0.68 s, total: 1.07 s
Wall time: 0.14 s
array([  0.00000000e+00,   1.00000000e+00,   2.00000000e+00, ...,
          9.99999970e+07,   9.99999980e+07,   9.99999990e+07])

which is 3x faster than using 1 single thread (look at wall time figures).

It has to be clear that this is purely due to the fact that several 
cores can transmit data to/from memory from/to CPU faster than one 
single core.  I have seen this behavior lots of times; for example, in 
slide 21 of this presentation:

http://pydata.org/pycon2012/numexpr-cython/slides.pdf

one can see how using several cores can speed up not only a polynomial 
computation, but also the simple expression "y = x", which is 
essentially a memory copy.

Another example where this effect can be seen is the Blosc compressor.  
For example, in:

http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks

the first points on each of the plots means that Blosc is in compression 
level 0, that is, it does not compress at all, and it basically copies 
data from origin to destination buffers.  Still, one can see that using 
several threads can accelerate this copy well beyond memcpy speed.

So, definitely, several cores can make your memory I/O bounded 
computations go faster.

-- 
Francesc Alted




More information about the NumPy-Discussion mailing list