[Python-ideas] solving multi-core Python

Wed Jun 24 07:01:24 CEST 2015

On Sun, Jun 21, 2015 at 5:41 AM, Sturla Molden <sturla.molden at gmail.com> wrote:
> From the perspective of software design, it would be good it the CPython
> interpreter provided an environment instead of using global objects. It
> would mean that all functions in the C API would need to take the
> environment pointer as their first variable, which will be a major rewrite.
> It would also allow the "one interpreter per thread" design similar to tcl
> and .NET application domains.

While perhaps a worthy goal, I don't know that it fits in well with my
goals.  I'm aiming for an improved multi-core story with a minimum of
change in the interpreter.

>
> However, from the perspective of multi-core parallel computing, I am not
> sure what this offers over using multiple processes.
>
> Yes, you avoid the process startup time, but on POSIX systems a fork is very
> fast. An certainly, forking is much more efficient than serializing Python
> objects.

You still need the mechanism to safely and efficiently share (at least
some) objects between interpreters after forking.  I expect this will
be simpler within the same process.

> It then boils down to a workaround for the fact that Windows cannot
> fork, which makes it particularly bad for running CPython.

We cannot leave Windows out in the cold.

> You also have to
> start up a subinterpreter and a thread, which is not instantaneous. So I am
> not sure there is a lot to gain here over calling os.fork.

One key difference is that with a subinterpreter you are basically
starting with a clean slate.  The isolation between interpreters
extends to the initial state.  That level of isolation is a desirable
feature because you can more clearly reason about the state of the
running tasks.

>
> A non-valid argument for this kind of design is that only code which uses
> threads for parallel computing is "real" multi-core code. So Python does not
> support multi-cores because multiprocessing or os.fork is just faking it.
> This is an argument that belongs in the intellectual junk yard. It stems
> from the abuse of threads among Windows and Java developers, and is rooted
> in the absence of fork on Windows and the formerly slow fork on Solaris. And
> thus they are only able to think in terms of threads. If threading.Thread
> does not scale the way they want, they think multicores are out of reach.

Well, perception is 9/10ths of the law. :)  If the multi-core problem
is already solved in Python then why does it fail in the court of
public opinion.  The perception that Python lacks a good multi-core
story is real, leads organizations away from Python, and will not
improve without concrete changes.  Contrast that with Go or Rust or
many other languages that make it simple to leverage multiple cores
(even if most people never need to).

>
> So the question is, how do you want to share objects between
> subinterpreters? And why is it better than IPC, when your idea is to isolate
> subinterpreters like application domains?

In return, my question is, what is the level of effort to get fork+IPC
to do what we want vs. subinterpreters?  Note that we need to
accommodate Windows as more than an afterthought (or second-class
citizen), as well as other execution environments (e.g. embedded)
where we may not be able to fork.

>
> If you think avoiding IPC is clever, you are wrong. IPC is very fast, in
> fact programs written to use MPI tends to perform and scale better than
> programs written to use OpenMP in parallel computing.

I'd love to learn more about that.  I'm sure there are some great
lessons on efficiently and safely sharing data between isolated
execution environments.  That said, how does IPC compare to passing
objects around within the same process?

> Not only is IPC fast,
> but you also avoid an issue called "false sharing", which can be even more
> detrimental than the GIL: You have parallel code, but it seems to run in
> serial, even though there is no explicit serialization anywhere. And by
> since Murphy's law is working against us, Python reference counts will be
> false shared unless we use multiple processes.

Solving reference counts in this situation is a separate issue that
will likely need to be resolved, regardless of which machinery we use
to isolate task execution.

> The reason IPC in multiprocessing is slow is due to calling pickle, it is
> not the IPC in itself. A pipe or an Unix socket (named pipe on Windows) have
> the overhead of a memcpy in the kernel, which is equal to a memcpy plus some
> tiny constant overhead. And if you need two processes to share memory, there
> is something called shared memory. Thus, we can send data between processes
> just as fast as between subinterpreters.

IPC sounds great, but how well does it interact with Python's memory
management/allocator?  I haven't looked closely but I expect that
multiprocessing does not use IPC anywhere.

>
> All in all, I think we are better off finding a better way to share Python
> objects between processes.

I expect that whatever solution we would find for subinterpreters
would have a lot in common with the same thing for processes.

> P.S. Another thing to note is that with sub-interpreters, you can forget
> about using ctypes or anything else that uses the simplified GIL API (e.g.
> certain Cython generated extensions).

On the one hand there are some rough edges with subinterpreters that
need to be fixed.  On the other hand, we will have to restrict the
subinterpreter model (at least initially) in ways that would likely
preclude operation of existing extension modules.

-eric