[Python-ideas] solving multi-core Python

Gregory P. Smith greg at krypto.org
Wed Jun 24 21:31:56 CEST 2015


On Wed, Jun 24, 2015 at 8:27 AM Sturla Molden <sturla.molden at gmail.com>
wrote:

> On 24/06/15 07:01, Eric Snow wrote:
>
> > In return, my question is, what is the level of effort to get fork+IPC
> > to do what we want vs. subinterpreters?  Note that we need to
> > accommodate Windows as more than an afterthought
>
> Windows is really the problem. The absence of fork() is especially
> hurtful for an interpreted language like Python, in my opinion.
>

You cannot assume that fork() is safe on any OS as a general solution for
anything.  This isn't a Windows specific problem, It simply cannot be
relied upon in a general purpose library at all.  It is incompatible with
threads.

The ways fork() can be used safely are in top level application decisions:
There must be a guarantee of no threads running before all forking is done.
 (thus the impossibility of relying on it as a mechanism to do anything
useful in a generic library - you are a library, you don't know what the
whole application is doing or when you were called as part of it)

A concurrency model that assumes that it is fine to fork() and let child
processes continue to execute is not usable by everyone. (ie:
multiprocessing until http://bugs.python.org/issue8713 was implemented).

-gps


>
> >> If you think avoiding IPC is clever, you are wrong. IPC is very fast, in
> >> fact programs written to use MPI tends to perform and scale better than
> >> programs written to use OpenMP in parallel computing.
> >
> > I'd love to learn more about that.  I'm sure there are some great
> > lessons on efficiently and safely sharing data between isolated
> > execution environments.  That said, how does IPC compare to passing
> > objects around within the same process?
>
> There are two major competing standards for parallel computing in
> science and engineering: OpenMP and MPI. OpenMP is based on a shared
> memory model. MPI is based on a distributed memory model and use message
> passing (hence its name).
>
> The common implementations of OpenMP (GNU, Intel, Microsoft) are all
> implemented with threads. There are also OpenMP implementations for
> clusters (e.g. Intel), but from the programmer's perspective OpenMP is a
> shared memory model.
>
> The common implementations of MPI (MPICH, OpenMPI, Microsoft MPI) use
> processes instead of threads. Processes can run on the same computer or
> on different computers (aka "clusters"). On localhost shared memory is
> commonly used for message passing, on clusters MPI implementations will
> use networking protocols.
>
> The take-home message is that OpenMP is conceptually easier to use, but
> programs written to use MPI tend to be faster and scale better. This is
> even true when using a single computer, e.g. a laptop with one multicore
> CPU.
>
>
> Here is tl;dr explanation:
>
> As for ease of programming, it is easier to create a deadlock or
> livelock with MPI than OpenMP, even though programs written to use MPI
> tend to need fewer synchronization points. There is also less
> boilerplate code to type when using OpenMP, because we do not have to
> code object serialization, message passing, and object deserialization.
>
> For performance, programs written to use MPI seems to have a larger
> overhead because they require object serialization and message passing,
> whereas OpenMP threads can just share the same objects. The reality is
> actually the opposite, and is due to the internals of modern CPU,
> particularly hierarchichal memory, branch prediction and long pipelines.
>
> Because of hierarchichal memory, the cache used by CPUs and CPU cores
> must be kept in synch. Thus when using OpenMP (threads) there will be a
> lot of synchronization going on that the programmer does not see, but
> which the hardware will do behind the scenes. There will also be a lot
> of data passing between various cache levels on the CPU and RAM. If a
> core writes to a pice of memory it keeps in a cache line, a cascade of
> data traffic and synchronization can be triggered across all CPUs and
> cores. Not only will this stop the CPUs and prompt them to synchronize
> cache with RAM, it also invalidates their branch prediction and they
> must flush their pipelines and throw away work they have already done.
> The end result is a program that does not scale or perform very well,
> even though it does not seem to have any explicit synchronization points
> that could explain this. The term "false sharing" is often used to
> describe this problem.
>
> Programs written to use MPI are the opposite. There every instance of
> synchronization and message passing is visible. When a CPU core writes
> to memory kept in a cache line, it will never trigger synchronization
> and data traffic across all the CPUs. The scalability is as the program
> predicts. And even though memory and objects are not shared, there is
> actually much less data traffic going on.
>
> Which to use? Most people find it easier to use OpenMP, and it does not
> require a big runtime environment to be installed. But programs using
> MPI tend to be the faster and more scalable. If you need to ensure
> scalability on multicores, multiple processes are better than multiple
> threads. The scalability of MPI also applies to Python's
> multiprocessing. It is the isolated virtual memory of each process that
> allows the cores to run at full speed.
>
> Another thing to note is that Windows is not a second-class citizen when
> using MPI. The MPI runtime (usually an executable called mpirun or
> mpiexec) starts and manages a group of processes. It does not matter if
> they are started by fork() or CreateProcess().
>
>
>
> > Solving reference counts in this situation is a separate issue that
> > will likely need to be resolved, regardless of which machinery we use
> > to isolate task execution.
>
> As long as we have a GIL, and we need the GIL to update a reference
> count, it does not hurt so much as it otherwise would. The GIL hides
> most of the scalability impact by serializing flow of execution.
>
>
>
> > IPC sounds great, but how well does it interact with Python's memory
> > management/allocator?  I haven't looked closely but I expect that
> > multiprocessing does not use IPC anywhere.
>
> multiprocessing does use IPC. Otherwise the processes could not
> communicate. One example is multiprocessing.Queue, which uses a pipe and
> a semaphore.
>
>
>
> Sturla
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150624/af575a4c/attachment.html>


More information about the Python-ideas mailing list