
On Thu, Sep 7, 2017 at 4:23 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 7 September 2017 at 15:48, Nathaniel Smith <njs@pobox.com> wrote:
I've actually argued with the PyPy devs to try to convince them to add subinterpreter support as part of their experiments with GIL-removal, because I think the semantics would genuinely be nicer to work with than raw threads, but they're convinced that it's impossible to make this work. Or more precisely, they think you could make it work in theory, but that it would be impossible to make it meaningfully more efficient than using multiple processes. I want them to be wrong, but I have to admit I can't see a way to make it work either...
The gist of the idea is that with subinterpreters, your starting point is multiprocessing-style isolation (i.e. you have to use pickle to transfer data between subinterpreters), but you're actually running in a shared-memory threading context from the operating system's perspective, so you don't need to rely on mmap to share memory over a non-streaming interface.
The challenge is that streaming bytes between processes is actually really fast -- you don't really need mmap for that. (Maybe this was important for X11 back in the 1980s, but a lot has changed since then :-).) And if you want to use pickle and multiprocessing to send, say, a single big numpy array between processes, that's also really fast, because it's basically just a few memcpy's. The slow case is passing complicated objects between processes, and it's slow because pickle has to walk the object graph to serialize it, and walking the object graph is slow. Copying object graphs between subinterpreters has the same problem. So the only case I can see where I'd expect subinterpreters to make communication dramatically more efficient is if you have a "deeply immutable" type: one where not only are its instances immutable, but all objects reachable from those instances are also guaranteed to be immutable. So like, a tuple except that when you instantiate it it validates that all of its elements are also marked as deeply immutable, and errors out if not. Then when you go to send this between subinterpreters, you can tell by checking the type of the root object that the whole graph is immutable, so you don't need to walk it yourself. However, it seems impossible to support user-defined deeply-immutable types in Python: types and functions are themselves mutable and hold tons of references to other potentially mutable objects via __mro__, __globals__, __weakrefs__, etc. etc., so even if a user-defined instance can be made logically immutable it's still going to hold references to mutable things. So the one case where subinterpreters win is if you have a really big and complicated set of nested pseudo-tuples of ints and strings and you're bottlenecked on passing it between interpreters. Maybe frozendicts too. Is that enough to justify the whole endeavor? It seems dubious to me. I guess the other case where subprocesses lose to "real" threads is startup time on Windows. But starting a subinterpreter is also much more expensive than starting a thread, once you take into account the cost of loading the application's modules into the new interpreter. In both cases you end up needing some kind of process/subinterpreter pool or cache to amortize that cost. Obviously I'm committing the cardinal sin of trying to guess about performance based on theory instead of measurement, so maybe I'm wrong. Or maybe there's some deviously clever trick I'm missing. I hope so -- a really useful subinterpreter multi-core store would be awesome. -n -- Nathaniel J. Smith -- https://vorpus.org