Re: [Python-ideas] solving multi-core Python

June 24, 2015

      On 24/06/15 07:01, Eric Snow wrote:
...
In return, my question is, what is the level of effort to get fork+IPC
to do what we want vs. subinterpreters?  Note that we need to
accommodate Windows as more than an afterthought
Windows is really the problem. The absence of fork() is especially 
hurtful for an interpreted language like Python, in my opinion.
...
...
If you think avoiding IPC is clever, you are wrong. IPC is very fast, in
fact programs written to use MPI tends to perform and scale better than
programs written to use OpenMP in parallel computing.
I'd love to learn more about that.  I'm sure there are some great
lessons on efficiently and safely sharing data between isolated
execution environments.  That said, how does IPC compare to passing
objects around within the same process?
There are two major competing standards for parallel computing in 
science and engineering: OpenMP and MPI. OpenMP is based on a shared 
memory model. MPI is based on a distributed memory model and use message 
passing (hence its name).

The common implementations of OpenMP (GNU, Intel, Microsoft) are all 
implemented with threads. There are also OpenMP implementations for 
clusters (e.g. Intel), but from the programmer's perspective OpenMP is a 
shared memory model.

The common implementations of MPI (MPICH, OpenMPI, Microsoft MPI) use 
processes instead of threads. Processes can run on the same computer or 
on different computers (aka "clusters"). On localhost shared memory is 
commonly used for message passing, on clusters MPI implementations will 
use networking protocols.

The take-home message is that OpenMP is conceptually easier to use, but 
programs written to use MPI tend to be faster and scale better. This is 
even true when using a single computer, e.g. a laptop with one multicore 
CPU.

Here is tl;dr explanation:

As for ease of programming, it is easier to create a deadlock or 
livelock with MPI than OpenMP, even though programs written to use MPI 
tend to need fewer synchronization points. There is also less 
boilerplate code to type when using OpenMP, because we do not have to 
code object serialization, message passing, and object deserialization.

For performance, programs written to use MPI seems to have a larger 
overhead because they require object serialization and message passing, 
whereas OpenMP threads can just share the same objects. The reality is 
actually the opposite, and is due to the internals of modern CPU, 
particularly hierarchichal memory, branch prediction and long pipelines.

Because of hierarchichal memory, the cache used by CPUs and CPU cores 
must be kept in synch. Thus when using OpenMP (threads) there will be a 
lot of synchronization going on that the programmer does not see, but 
which the hardware will do behind the scenes. There will also be a lot 
of data passing between various cache levels on the CPU and RAM. If a 
core writes to a pice of memory it keeps in a cache line, a cascade of 
data traffic and synchronization can be triggered across all CPUs and 
cores. Not only will this stop the CPUs and prompt them to synchronize 
cache with RAM, it also invalidates their branch prediction and they 
must flush their pipelines and throw away work they have already done.
The end result is a program that does not scale or perform very well, 
even though it does not seem to have any explicit synchronization points 
that could explain this. The term "false sharing" is often used to 
describe this problem.

Programs written to use MPI are the opposite. There every instance of 
synchronization and message passing is visible. When a CPU core writes 
to memory kept in a cache line, it will never trigger synchronization 
and data traffic across all the CPUs. The scalability is as the program 
predicts. And even though memory and objects are not shared, there is 
actually much less data traffic going on.

Which to use? Most people find it easier to use OpenMP, and it does not 
require a big runtime environment to be installed. But programs using 
MPI tend to be the faster and more scalable. If you need to ensure 
scalability on multicores, multiple processes are better than multiple 
threads. The scalability of MPI also applies to Python's 
multiprocessing. It is the isolated virtual memory of each process that 
allows the cores to run at full speed.

Another thing to note is that Windows is not a second-class citizen when 
using MPI. The MPI runtime (usually an executable called mpirun or 
mpiexec) starts and manages a group of processes. It does not matter if 
they are started by fork() or CreateProcess().
...
Solving reference counts in this situation is a separate issue that
will likely need to be resolved, regardless of which machinery we use
to isolate task execution.
As long as we have a GIL, and we need the GIL to update a reference 
count, it does not hurt so much as it otherwise would. The GIL hides 
most of the scalability impact by serializing flow of execution.
...
IPC sounds great, but how well does it interact with Python's memory
management/allocator?  I haven't looked closely but I expect that
multiprocessing does not use IPC anywhere.
multiprocessing does use IPC. Otherwise the processes could not 
communicate. One example is multiprocessing.Queue, which uses a pipe and 
a semaphore.

Sturla

Re: [Python-ideas] solving multi-core Python

Sturla Molden