
On Sun, Jun 21, 2015 at 5:41 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
From the perspective of software design, it would be good it the CPython interpreter provided an environment instead of using global objects. It would mean that all functions in the C API would need to take the environment pointer as their first variable, which will be a major rewrite. It would also allow the "one interpreter per thread" design similar to tcl and .NET application domains.
While perhaps a worthy goal, I don't know that it fits in well with my goals. I'm aiming for an improved multi-core story with a minimum of change in the interpreter.
However, from the perspective of multi-core parallel computing, I am not sure what this offers over using multiple processes.
Yes, you avoid the process startup time, but on POSIX systems a fork is very fast. An certainly, forking is much more efficient than serializing Python objects.
You still need the mechanism to safely and efficiently share (at least some) objects between interpreters after forking. I expect this will be simpler within the same process.
It then boils down to a workaround for the fact that Windows cannot fork, which makes it particularly bad for running CPython.
We cannot leave Windows out in the cold.
You also have to start up a subinterpreter and a thread, which is not instantaneous. So I am not sure there is a lot to gain here over calling os.fork.
One key difference is that with a subinterpreter you are basically starting with a clean slate. The isolation between interpreters extends to the initial state. That level of isolation is a desirable feature because you can more clearly reason about the state of the running tasks.
A non-valid argument for this kind of design is that only code which uses threads for parallel computing is "real" multi-core code. So Python does not support multi-cores because multiprocessing or os.fork is just faking it. This is an argument that belongs in the intellectual junk yard. It stems from the abuse of threads among Windows and Java developers, and is rooted in the absence of fork on Windows and the formerly slow fork on Solaris. And thus they are only able to think in terms of threads. If threading.Thread does not scale the way they want, they think multicores are out of reach.
Well, perception is 9/10ths of the law. :) If the multi-core problem is already solved in Python then why does it fail in the court of public opinion. The perception that Python lacks a good multi-core story is real, leads organizations away from Python, and will not improve without concrete changes. Contrast that with Go or Rust or many other languages that make it simple to leverage multiple cores (even if most people never need to).
So the question is, how do you want to share objects between subinterpreters? And why is it better than IPC, when your idea is to isolate subinterpreters like application domains?
In return, my question is, what is the level of effort to get fork+IPC to do what we want vs. subinterpreters? Note that we need to accommodate Windows as more than an afterthought (or second-class citizen), as well as other execution environments (e.g. embedded) where we may not be able to fork.
If you think avoiding IPC is clever, you are wrong. IPC is very fast, in fact programs written to use MPI tends to perform and scale better than programs written to use OpenMP in parallel computing.
I'd love to learn more about that. I'm sure there are some great lessons on efficiently and safely sharing data between isolated execution environments. That said, how does IPC compare to passing objects around within the same process?
Not only is IPC fast, but you also avoid an issue called "false sharing", which can be even more detrimental than the GIL: You have parallel code, but it seems to run in serial, even though there is no explicit serialization anywhere. And by since Murphy's law is working against us, Python reference counts will be false shared unless we use multiple processes.
Solving reference counts in this situation is a separate issue that will likely need to be resolved, regardless of which machinery we use to isolate task execution.
The reason IPC in multiprocessing is slow is due to calling pickle, it is not the IPC in itself. A pipe or an Unix socket (named pipe on Windows) have the overhead of a memcpy in the kernel, which is equal to a memcpy plus some tiny constant overhead. And if you need two processes to share memory, there is something called shared memory. Thus, we can send data between processes just as fast as between subinterpreters.
IPC sounds great, but how well does it interact with Python's memory management/allocator? I haven't looked closely but I expect that multiprocessing does not use IPC anywhere.
All in all, I think we are better off finding a better way to share Python objects between processes.
I expect that whatever solution we would find for subinterpreters would have a lot in common with the same thing for processes.
P.S. Another thing to note is that with sub-interpreters, you can forget about using ctypes or anything else that uses the simplified GIL API (e.g. certain Cython generated extensions).
On the one hand there are some rough edges with subinterpreters that need to be fixed. On the other hand, we will have to restrict the subinterpreter model (at least initially) in ways that would likely preclude operation of existing extension modules. -eric