
On Fri, Mar 5, 2010 at 7:29 PM, Brian Granger <ellisonbg.net@gmail.com> wrote:
Francesc,
Yeah, 10% of improvement by using multi-cores is an expected figure for memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes.
+1
Thanks for emphasizing this. This is definitely a big issue with multicore.
Cheers,
Brian
Hi, here's a few notes... A) cache B) multiple cores/cpu multiplies other optimisations. A) Understanding cache is also very useful. Cache at two levels: 1. disk cache. 2. cpu/core cache. 1. Mmap'd files are useful since you can reuse disk cache as program memory. So large files don't waste ram on the disk cache. For example, processing a 1 gig file can use 1GB of memory with mmap, but 2GB without. ps, learn about madvise for extra goodness :) mmap behaviour is very different on windows/linux/mac osx. The best mmap implementation is on linux. Note, that on some OS's the disk cache has separate reserved areas of memory which processes can not use... so mmap is the easiest way to access it. mmaping on SSDs is also quite fast :) 2. cpu cache is what can give you a speedup when you use extra cpus/cores. There are a number of different cpu architectures these days... but generally you will get a speed up if your cpus access different areas of memory. So don't get cpu1 to process one part of data, then cpu2 - otherwise the cache can get invalidated. Especially if you have a 8MB cache per cpu :) This is why the Xeons, and other high end cpus will give you numpy speedups more easily. Also consider processing in chunks less than the size of your cache (especially for multi pass arguments). There's a lot to caching, but I think the above gives enough useful hints :) B) Also, multiple processes can multiply the effects of your other optimisations. A 2x speed up via SSE or other SIMD can be multiplied over each cpu/core. So if you code gets 8x faster with multiple processes, then the 2x optimisation is likely a 16x speed up. The following is a common with optimisation pattern with python code.
From python to numpy you can get a 20x speedup. From numpy to C/C++ you can get up to 5 times speed up (or 50x over python). Then an asm optimisation is 2-4x faster again.
So up to 200x faster compared to pure python... then multiply that by 8x, and you have up to 1600x faster code :) Also small optimisations add up... a small 0.2 times speedup can turn into a 1.6 times speed up easily when you have multiple cpus. So as you can see... multiple cores makes it EASIER to optimise programs, since your optimisations are often multiplied. cu,