[CentralOH] embarrassingly parallel loops question

Sun Jul 31 10:03:23 EDT 2016

Linear or better is possible,

https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup

but unlikely with a high level interpreted language in only 30ms.  The
communication and synchronization between processes is the first thing
to look at, especially if it's on a laptop with a power conserving
scheduler policy.  Try implementing what TBB calls grain size, ie break
the list of tasks up into exactly numCores sublists to minimize the
number of communications between processes.

On Sat, 30 Jul 2016 08:38:58 -0400
Brian <bnmille at gmail.com> wrote:
> You will never get linear speed improvement with multi-core.  There is a
> lot of kernel overhead needed to manage it.  And the more cores, the more
> overhead.  Also, are you running your tests on virtual machines?  Due to
> the way multi-cores are allocated to a virtual machine, assigning more
> cores than necessary/needed can actually slow things down.
> 
> On Jul 29, 2016 2:38 PM, "Joe Shaw" <joe at joeshaw.org> wrote:
> >
> > Hi,
> >
> > If you're on Linux one thing you might want to try is the perf tool.
> That might give you a sense if the overhead is in the Python runtime, or if
> page faults or syscalls are the bottleneck.  If you think it might be in
> the Python code itself, running with the built-in profiler might be helpful.
> >
> > I have no idea how either of these interact with multiprocessing, however.
> >
> > Lastly, cache line locality is a big deal.  I don't know to what extent
> you can optimize memory layout in Python programs (maybe with numpy?) but
> if you can get data in contiguous memory you will greatly improve your
> L1/L2 cache hit rate and the CPU won't have to go to (comparatively much
> slower) RAM.
> >
> > Joe
> >
> > On Fri, Jul 29, 2016 at 1:44 PM, Samuel <sriveravi at gmail.com> wrote:
> >>
> >> Hello Group,
> >>
> >> So I have this embarrassingly parallel number crunching i'm trying to do
> in a for loop.  Each iteration there is some crunching that is independent
> of all other iterations, so I was able to set this up pretty easy using a
> multiprocessing pool.  (Side detail, each iteration depends on some common
> data structures that I make global and gives me the fastest cruch time
> versus passing to each thread explicitly).  Takes about 30ms to run:
> >>
> >>
> >> import multiprocessing
> >> pool = multiprocessing.Pool( numCores)
> >> results = pool.map( crunchFunctionIter, xrange(len(setN)))
> >>
> >>
> >> Running on 1 core, tiny slowdown (~5ms overhead, ~35 ms to run)
> >> Running on 2 cores I get about a 2x speedup which is great and expected
> ( ~18ms to run).
> >> But the speedup saturates there and I can't get more juice even when
> upping to 4 or 6 cores.
> >>
> >> The thing is, all iterations are pretty much independent so I don't see
> why in theory I don't get close to a linear speedup.  Or at least an (N-1)
> speedup.  My guess is there is something weird with the memory sharing that
> is causing unnecessary overhead.  Another colleague doing a similar
> embarrassingly parallel problem saw the same saturation at about 2 cores.
> >>
> >> Any thoughts on what is going on, or what I need to do to make this
> embarrassingly parallel thing speedup linearly?  Should I just use a
> different library and set up my data structures a different way?
> >>
> >> Thanks,
> >> Sam