[Numpy-discussion] question about in-place operations
Francesc Alted
francesc at continuum.io
Thu May 24 04:32:43 EDT 2012
On 5/22/12 9:08 PM, Massimo Di Pierro wrote:
> This problem is linear so probably Ram IO bound. I do not think I
> would benefit much for multiple cores. But I will give it a try. In
> the short term this is good enough for me.
Yeah, this what common sense seems to indicate, that RAM IO bound
problems do not benefit from using multiple cores. But reality is
different:
>>> import numpy as np
>>> a = np.arange(1e8)
>>> c = 1.0
>>> time a*c
CPU times: user 0.22 s, sys: 0.20 s, total: 0.43 s
Wall time: 0.43 s
array([ 0.00000000e+00, 1.00000000e+00, 2.00000000e+00, ...,
9.99999970e+07, 9.99999980e+07, 9.99999990e+07])
Using numexpr with 1 thread:
>>> import numexpr as ne
>>> ne.set_num_threads(1)
8
>>> time ne.evaluate("a*c")
CPU times: user 0.20 s, sys: 0.25 s, total: 0.45 s
Wall time: 0.45 s
array([ 0.00000000e+00, 1.00000000e+00, 2.00000000e+00, ...,
9.99999970e+07, 9.99999980e+07, 9.99999990e+07])
while using 8 threads (the machine has 8 physical cores):
>>> ne.set_num_threads(8)
1
>>> time ne.evaluate("a*c")
CPU times: user 0.39 s, sys: 0.68 s, total: 1.07 s
Wall time: 0.14 s
array([ 0.00000000e+00, 1.00000000e+00, 2.00000000e+00, ...,
9.99999970e+07, 9.99999980e+07, 9.99999990e+07])
which is 3x faster than using 1 single thread (look at wall time figures).
It has to be clear that this is purely due to the fact that several
cores can transmit data to/from memory from/to CPU faster than one
single core. I have seen this behavior lots of times; for example, in
slide 21 of this presentation:
http://pydata.org/pycon2012/numexpr-cython/slides.pdf
one can see how using several cores can speed up not only a polynomial
computation, but also the simple expression "y = x", which is
essentially a memory copy.
Another example where this effect can be seen is the Blosc compressor.
For example, in:
http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks
the first points on each of the plots means that Blosc is in compression
level 0, that is, it does not compress at all, and it basically copies
data from origin to destination buffers. Still, one can see that using
several threads can accelerate this copy well beyond memcpy speed.
So, definitely, several cores can make your memory I/O bounded
computations go faster.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list