
On Fri, Mar 05, 2010 at 08:14:51AM -0500, Francesc Alted wrote:
FWIW, I observe very good speedups on my problems (pretty much linear in the number of CPUs), and I have data parallel problems on fairly large data (~100Mo a piece, doesn't fit in cache), with no synchronisation at all between the workers. CPUs are Intel Xeons.
Maybe your processes are not as memory-bound as you think.
That's the only explaination that I can think of. I have two types of bottlenecks. One is blas level 3 operations (mainly SVDs) on large matrices, the second is resampling, where are repeat the same operation many times over almost the same chunk of data. In both cases the data is fairly large, so I expected the operations to be memory bound. However, thinking of it, I believe that when I had timed these operations carefully, it seems that processes were alternating a starving period, during which they were IO-bound, and a productive period, during which they were CPU-bound. After a few cycles, the different periods would fall in a mutually disynchronised alternation, with one process IO-bound, and the others CPU-bound, and it would become fairly efficient. Of course, this is possible because I have no cross-talk between the processes.
Do you get much better speed-up by using NUMA than a simple multi-core machine with one single path to memory? I don't think so, but maybe I'm wrong here.
I don't know. All the boxes around here have Intel CPUs, and I believe that this is all SMPs. Gaƫl