[Python-ideas] solving multi-core Python
Sturla Molden
sturla.molden at gmail.com
Wed Jun 24 17:26:59 CEST 2015
On 24/06/15 07:01, Eric Snow wrote:
> In return, my question is, what is the level of effort to get fork+IPC
> to do what we want vs. subinterpreters? Note that we need to
> accommodate Windows as more than an afterthought
Windows is really the problem. The absence of fork() is especially
hurtful for an interpreted language like Python, in my opinion.
>> If you think avoiding IPC is clever, you are wrong. IPC is very fast, in
>> fact programs written to use MPI tends to perform and scale better than
>> programs written to use OpenMP in parallel computing.
>
> I'd love to learn more about that. I'm sure there are some great
> lessons on efficiently and safely sharing data between isolated
> execution environments. That said, how does IPC compare to passing
> objects around within the same process?
There are two major competing standards for parallel computing in
science and engineering: OpenMP and MPI. OpenMP is based on a shared
memory model. MPI is based on a distributed memory model and use message
passing (hence its name).
The common implementations of OpenMP (GNU, Intel, Microsoft) are all
implemented with threads. There are also OpenMP implementations for
clusters (e.g. Intel), but from the programmer's perspective OpenMP is a
shared memory model.
The common implementations of MPI (MPICH, OpenMPI, Microsoft MPI) use
processes instead of threads. Processes can run on the same computer or
on different computers (aka "clusters"). On localhost shared memory is
commonly used for message passing, on clusters MPI implementations will
use networking protocols.
The take-home message is that OpenMP is conceptually easier to use, but
programs written to use MPI tend to be faster and scale better. This is
even true when using a single computer, e.g. a laptop with one multicore
CPU.
Here is tl;dr explanation:
As for ease of programming, it is easier to create a deadlock or
livelock with MPI than OpenMP, even though programs written to use MPI
tend to need fewer synchronization points. There is also less
boilerplate code to type when using OpenMP, because we do not have to
code object serialization, message passing, and object deserialization.
For performance, programs written to use MPI seems to have a larger
overhead because they require object serialization and message passing,
whereas OpenMP threads can just share the same objects. The reality is
actually the opposite, and is due to the internals of modern CPU,
particularly hierarchichal memory, branch prediction and long pipelines.
Because of hierarchichal memory, the cache used by CPUs and CPU cores
must be kept in synch. Thus when using OpenMP (threads) there will be a
lot of synchronization going on that the programmer does not see, but
which the hardware will do behind the scenes. There will also be a lot
of data passing between various cache levels on the CPU and RAM. If a
core writes to a pice of memory it keeps in a cache line, a cascade of
data traffic and synchronization can be triggered across all CPUs and
cores. Not only will this stop the CPUs and prompt them to synchronize
cache with RAM, it also invalidates their branch prediction and they
must flush their pipelines and throw away work they have already done.
The end result is a program that does not scale or perform very well,
even though it does not seem to have any explicit synchronization points
that could explain this. The term "false sharing" is often used to
describe this problem.
Programs written to use MPI are the opposite. There every instance of
synchronization and message passing is visible. When a CPU core writes
to memory kept in a cache line, it will never trigger synchronization
and data traffic across all the CPUs. The scalability is as the program
predicts. And even though memory and objects are not shared, there is
actually much less data traffic going on.
Which to use? Most people find it easier to use OpenMP, and it does not
require a big runtime environment to be installed. But programs using
MPI tend to be the faster and more scalable. If you need to ensure
scalability on multicores, multiple processes are better than multiple
threads. The scalability of MPI also applies to Python's
multiprocessing. It is the isolated virtual memory of each process that
allows the cores to run at full speed.
Another thing to note is that Windows is not a second-class citizen when
using MPI. The MPI runtime (usually an executable called mpirun or
mpiexec) starts and manages a group of processes. It does not matter if
they are started by fork() or CreateProcess().
> Solving reference counts in this situation is a separate issue that
> will likely need to be resolved, regardless of which machinery we use
> to isolate task execution.
As long as we have a GIL, and we need the GIL to update a reference
count, it does not hurt so much as it otherwise would. The GIL hides
most of the scalability impact by serializing flow of execution.
> IPC sounds great, but how well does it interact with Python's memory
> management/allocator? I haven't looked closely but I expect that
> multiprocessing does not use IPC anywhere.
multiprocessing does use IPC. Otherwise the processes could not
communicate. One example is multiprocessing.Queue, which uses a pipe and
a semaphore.
Sturla
More information about the Python-ideas
mailing list