[Numpy-discussion] multiprocessing shared arrays and numpy

Sun Mar 7 14:00:03 EST 2010

On Fri, Mar 5, 2010 at 7:29 PM, Brian Granger <ellisonbg.net at gmail.com> wrote:
> Francesc,
>
>> Yeah, 10% of improvement by using multi-cores is an expected figure for
>> memory
>> bound problems.  This is something people must know: if their computations
>> are
>> memory bound (and this is much more common that one may initially think),
>> then
>> they should not expect significant speed-ups on their parallel codes.
>>
>
> +1
>
> Thanks for emphasizing this.  This is definitely a big issue with multicore.
>
> Cheers,
>
> Brian
>

Hi,

here's a few notes...

A) cache
B) multiple cores/cpu multiplies other optimisations.

A) Understanding cache is also very useful.

Cache at two levels:
    1. disk cache.
    2. cpu/core cache.

1. Mmap'd files are useful since you can reuse disk cache as program
memory.  So large files don't waste ram on the disk cache.  For
example, processing a 1 gig file can use 1GB of memory with mmap, but
2GB without.  ps, learn about madvise for extra goodness :)  mmap
behaviour is very different on windows/linux/mac osx.  The best mmap
implementation is on linux.  Note, that on some OS's the disk cache
has separate reserved areas of memory which processes can not use...
so mmap is the easiest way to access it.  mmaping on SSDs is also
quite fast :)

2. cpu cache is what can give you a speedup when you use extra
cpus/cores.  There are a number of different cpu architectures these
days... but generally you will get a speed up if your cpus access
different areas of memory.  So don't get cpu1 to process one part of
data, then cpu2 - otherwise the cache can get invalidated.  Especially
if you have a 8MB cache per cpu :)  This is why the Xeons, and other
high end cpus will give you numpy speedups more easily.  Also consider
processing in chunks less than the size of your cache (especially for
multi pass arguments).

There's a lot to caching, but I think the above gives enough useful hints :)

B) Also, multiple processes can multiply the effects of your other
optimisations.

A 2x speed up via SSE or other SIMD can be multiplied over each
cpu/core.  So if you code gets 8x faster with multiple processes, then
the 2x optimisation is likely a 16x speed up.

The following is a common with optimisation pattern with python code.
>From python to numpy you can get a 20x speedup.  From numpy to C/C++
you can get up to 5 times speed up (or 50x over python).  Then an asm
optimisation is 2-4x faster again.

So up to 200x faster compared to pure python... then multiply that by
8x, and you have up to 1600x faster code :)  Also small optimisations
add up... a small 0.2 times speedup can turn into a 1.6 times speed up
easily when you have multiple cpus.

So as you can see... multiple cores makes it EASIER to optimise
programs, since your optimisations are often multiplied.

cu,