
A Friday 05 March 2010 14:46:00 Gael Varoquaux escrigué:
On Fri, Mar 05, 2010 at 08:14:51AM -0500, Francesc Alted wrote:
FWIW, I observe very good speedups on my problems (pretty much linear in the number of CPUs), and I have data parallel problems on fairly large data (~100Mo a piece, doesn't fit in cache), with no synchronisation at all between the workers. CPUs are Intel Xeons.
Maybe your processes are not as memory-bound as you think.
That's the only explaination that I can think of. I have two types of bottlenecks. One is blas level 3 operations (mainly SVDs) on large matrices, the second is resampling, where are repeat the same operation many times over almost the same chunk of data. In both cases the data is fairly large, so I expected the operations to be memory bound.
Not at all. BLAS 3 operations are mainly CPU-bounded, because algorithms (if they are correctly implemented, of course, but any decent BLAS 3 library will do) have many chances to reuse data from caches. BLAS 1 (and lately 2 too) are the ones that are memory-bound. And in your second case, you are repeating the same operation over the same chunk of data. If this chunk is small enough to fit in cache, then the bottleneck is CPU again (and probably access to L1/L2 cache), and not access to memory. But if, as you said, you are seeing periods that are memory- bounded (i.e. CPUs are starving), then it may well be that this chunksize does not fit well in cache, and then your problem is memory access for this case. Maybe you can get better performance by reducing your chunksize so that it fits in cache (L1 or L2). So, I do not think that NUMA architectures would perform your current computations any better than your current SMP platform (and you know that NUMA architectures are much more complex and expensive than SMP ones). But experimenting is *always* the best answer to these hairy questions ;-) -- Francesc Alted