
Francesc Alted wrote:
A Friday 22 May 2009 11:42:56 Gregor Thalhammer escrigué:
dmitrey schrieb: 3) Improving performance by using multi cores is much more difficult. Only for sufficiently large (>1e5) arrays a significant speedup is possible. Where a speed gain is possible, the MKL uses several cores. Some experimentation showed that adding a few OpenMP constructs you could get a similar speedup with numpy. 4) numpy.dot uses optimized implementations.
Good points Gregor. However, I wouldn't say that improving performance by using multi cores is *that* difficult, but rather that multi cores can only be used efficiently *whenever* the memory bandwith is not a limitation. An example of this is the computation of transcendental functions, where, even using vectorized implementations, the computation speed is still CPU-bounded in many cases. And you have experimented yourself very good speed-ups for these cases with your implementation of numexpr/MKL :)
Using multiple cores is pretty easy for element-wise ufuncs; no communication needs to occur and the work partitioning is trivial. And actually I've found with some initial testing that multiple cores does still help when you are memory bound. I don't fully understand why yet, though I have some ideas. One reason is multiple memory controllers due to multiple sockets (ie opteron). Another is that each thread is pulling memory from a different bank, utilizing more bandwidth than a single sequential thread could. However if that's the case, we could possibly come up with code for a single thread that achieves (nearly) the same additional throughput.. Andrew