[CentralOH] embarrassingly parallel loops question

Brian bnmille at gmail.com
Sat Jul 30 08:38:58 EDT 2016


You will never get linear speed improvement with multi-core.  There is a
lot of kernel overhead needed to manage it.  And the more cores, the more
overhead.  Also, are you running your tests on virtual machines?  Due to
the way multi-cores are allocated to a virtual machine, assigning more
cores than necessary/needed can actually slow things down.

On Jul 29, 2016 2:38 PM, "Joe Shaw" <joe at joeshaw.org> wrote:
>
> Hi,
>
> If you're on Linux one thing you might want to try is the perf tool.
That might give you a sense if the overhead is in the Python runtime, or if
page faults or syscalls are the bottleneck.  If you think it might be in
the Python code itself, running with the built-in profiler might be helpful.
>
> I have no idea how either of these interact with multiprocessing, however.
>
> Lastly, cache line locality is a big deal.  I don't know to what extent
you can optimize memory layout in Python programs (maybe with numpy?) but
if you can get data in contiguous memory you will greatly improve your
L1/L2 cache hit rate and the CPU won't have to go to (comparatively much
slower) RAM.
>
> Joe
>
> On Fri, Jul 29, 2016 at 1:44 PM, Samuel <sriveravi at gmail.com> wrote:
>>
>> Hello Group,
>>
>> So I have this embarrassingly parallel number crunching i'm trying to do
in a for loop.  Each iteration there is some crunching that is independent
of all other iterations, so I was able to set this up pretty easy using a
multiprocessing pool.  (Side detail, each iteration depends on some common
data structures that I make global and gives me the fastest cruch time
versus passing to each thread explicitly).  Takes about 30ms to run:
>>
>>
>> import multiprocessing
>> pool = multiprocessing.Pool( numCores)
>> results = pool.map( crunchFunctionIter, xrange(len(setN)))
>>
>>
>> Running on 1 core, tiny slowdown (~5ms overhead, ~35 ms to run)
>> Running on 2 cores I get about a 2x speedup which is great and expected
( ~18ms to run).
>> But the speedup saturates there and I can't get more juice even when
upping to 4 or 6 cores.
>>
>> The thing is, all iterations are pretty much independent so I don't see
why in theory I don't get close to a linear speedup.  Or at least an (N-1)
speedup.  My guess is there is something weird with the memory sharing that
is causing unnecessary overhead.  Another colleague doing a similar
embarrassingly parallel problem saw the same saturation at about 2 cores.
>>
>> Any thoughts on what is going on, or what I need to do to make this
embarrassingly parallel thing speedup linearly?  Should I just use a
different library and set up my data structures a different way?
>>
>> Thanks,
>> Sam
>>
>> _______________________________________________
>> CentralOH mailing list
>> CentralOH at python.org
>> https://mail.python.org/mailman/listinfo/centraloh
>>
>
>
> _______________________________________________
> CentralOH mailing list
> CentralOH at python.org
> https://mail.python.org/mailman/listinfo/centraloh
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/centraloh/attachments/20160730/d1357eb4/attachment.html>


More information about the CentralOH mailing list