
On 24/06/15 07:01, Eric Snow wrote:
In return, my question is, what is the level of effort to get fork+IPC to do what we want vs. subinterpreters? Note that we need to accommodate Windows as more than an afterthought
Windows is really the problem. The absence of fork() is especially hurtful for an interpreted language like Python, in my opinion.
If you think avoiding IPC is clever, you are wrong. IPC is very fast, in fact programs written to use MPI tends to perform and scale better than programs written to use OpenMP in parallel computing.
I'd love to learn more about that. I'm sure there are some great lessons on efficiently and safely sharing data between isolated execution environments. That said, how does IPC compare to passing objects around within the same process?
There are two major competing standards for parallel computing in science and engineering: OpenMP and MPI. OpenMP is based on a shared memory model. MPI is based on a distributed memory model and use message passing (hence its name). The common implementations of OpenMP (GNU, Intel, Microsoft) are all implemented with threads. There are also OpenMP implementations for clusters (e.g. Intel), but from the programmer's perspective OpenMP is a shared memory model. The common implementations of MPI (MPICH, OpenMPI, Microsoft MPI) use processes instead of threads. Processes can run on the same computer or on different computers (aka "clusters"). On localhost shared memory is commonly used for message passing, on clusters MPI implementations will use networking protocols. The take-home message is that OpenMP is conceptually easier to use, but programs written to use MPI tend to be faster and scale better. This is even true when using a single computer, e.g. a laptop with one multicore CPU. Here is tl;dr explanation: As for ease of programming, it is easier to create a deadlock or livelock with MPI than OpenMP, even though programs written to use MPI tend to need fewer synchronization points. There is also less boilerplate code to type when using OpenMP, because we do not have to code object serialization, message passing, and object deserialization. For performance, programs written to use MPI seems to have a larger overhead because they require object serialization and message passing, whereas OpenMP threads can just share the same objects. The reality is actually the opposite, and is due to the internals of modern CPU, particularly hierarchichal memory, branch prediction and long pipelines. Because of hierarchichal memory, the cache used by CPUs and CPU cores must be kept in synch. Thus when using OpenMP (threads) there will be a lot of synchronization going on that the programmer does not see, but which the hardware will do behind the scenes. There will also be a lot of data passing between various cache levels on the CPU and RAM. If a core writes to a pice of memory it keeps in a cache line, a cascade of data traffic and synchronization can be triggered across all CPUs and cores. Not only will this stop the CPUs and prompt them to synchronize cache with RAM, it also invalidates their branch prediction and they must flush their pipelines and throw away work they have already done. The end result is a program that does not scale or perform very well, even though it does not seem to have any explicit synchronization points that could explain this. The term "false sharing" is often used to describe this problem. Programs written to use MPI are the opposite. There every instance of synchronization and message passing is visible. When a CPU core writes to memory kept in a cache line, it will never trigger synchronization and data traffic across all the CPUs. The scalability is as the program predicts. And even though memory and objects are not shared, there is actually much less data traffic going on. Which to use? Most people find it easier to use OpenMP, and it does not require a big runtime environment to be installed. But programs using MPI tend to be the faster and more scalable. If you need to ensure scalability on multicores, multiple processes are better than multiple threads. The scalability of MPI also applies to Python's multiprocessing. It is the isolated virtual memory of each process that allows the cores to run at full speed. Another thing to note is that Windows is not a second-class citizen when using MPI. The MPI runtime (usually an executable called mpirun or mpiexec) starts and manages a group of processes. It does not matter if they are started by fork() or CreateProcess().
Solving reference counts in this situation is a separate issue that will likely need to be resolved, regardless of which machinery we use to isolate task execution.
As long as we have a GIL, and we need the GIL to update a reference count, it does not hurt so much as it otherwise would. The GIL hides most of the scalability impact by serializing flow of execution.
IPC sounds great, but how well does it interact with Python's memory management/allocator? I haven't looked closely but I expect that multiprocessing does not use IPC anywhere.
multiprocessing does use IPC. Otherwise the processes could not communicate. One example is multiprocessing.Queue, which uses a pipe and a semaphore. Sturla