
Gael, On Fri, Mar 05, 2010 at 10:51:12AM +0100, Gael Varoquaux wrote:
On Fri, Mar 05, 2010 at 09:53:02AM +0100, Francesc Alted wrote:
Yeah, 10% of improvement by using multi-cores is an expected figure for memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes.
Hey Francesc,
Any chance this can be different for NUMA (non uniform memory access) architectures? AMD multicores used to be NUMA, when I was still following these problems.
As far as I can tell, NUMA architectures work better accelerating independent processes that run independently one of each other. In this case, hardware is in charge of putting closely-related data in memory that is 'nearer' to each processor. This scenario *could* happen in truly parallel process too, but as I said, in general it works best for independent processes (read multiuser machines).
FWIW, I observe very good speedups on my problems (pretty much linear in the number of CPUs), and I have data parallel problems on fairly large data (~100Mo a piece, doesn't fit in cache), with no synchronisation at all between the workers. CPUs are Intel Xeons.
Maybe your processes are not as memory-bound as you think. Do you get much better speed-up by using NUMA than a simple multi-core machine with one single path to memory? I don't think so, but maybe I'm wrong here. Francesc