
I'm going to break mail client threading and also answer some of your other emails here. On Tue, Jun 23, 2015 at 10:26 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
It sounded like you were suggesting that we factor out a common code base that could be used by multiprocessing and the other machinery and that only multiprocessing would keep the pickle-related code.
Yes, I like that idea a lot.
Compare with forking, where the initialization is all done and then you fork, and you are immediately ready to serve, using the data structures shared with all the other workers, which is only copied when it is written to. So forking starts up faster and uses less memory (due to shared memory.)
But we are aiming for a share-nothing model with an efficient object-passing mechanism. Furthermore, subinterpreters do not have to be single-use. My proposal includes running tasks in an existing subinterpreter (e.g. executor pool), so that start-up cost is mitigated in cases where it matters.
Note that ultimately my goal is to make it obvious and undeniable that Python (3.6+) has a good multi-core story. In my proposal, subinterpreters are a means to an end. If there's a better solution then great! As long as the real goal is met I'll be satisfied. :) For now I'm still confident that the subinterpreter approach is the best option for meeting the goal.
Ahead of time: the following is my opinion. My opinions are my own, and bizarre, unlike the opinions of my employer and coworkers. (Who are also reading this maybe.) So there's two reasons I can think of to use threads for CPU parallelism: - My thing does a lot of parallel work, and so I want to save on memory by sharing an address space This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this. - My thing does a lot of communication, and so I want fast communication through a shared address space This can become a pressing concern immediately, and so is a more visible issue. However, it's also a non-problem for many kinds of tasks which just take requests in and put output back out, without talking with other members of the pool (e.g. writing an RPC server or HTTP server.) I would also speculate that once you're on many machines, unless you're very specific with your design, RPC costs dominate IPC costs to the point where optimizing IPC doesn't do a lot for you. On Unix, IPC can be free or cheap due to shared memory. Threads really aren't all that important, and if we need them, we have them. When people tell me in #python that multicore in Python is bad because of the GIL, I point them at fork and at C extensions, but also at PyPy-STM and Jython. Everything has problems, but then so does this proposal, right?
And this is faster than passing objects around within the same process? Does it play well with Python's memory model?
As far as whether it plays with the memory model, multiprocessing.Value() just works, today. To make it even lower overhead (not construct an int PyObject* on the fly), you need to change things, e.g. the way refcounts work. I think it's possibly feasible. If not, at least the overhead would be negligible. Same applies to strings and other non-compound datatypes. Compound datatypes are hard even for the subinterpreter case, just because the objects you're referring to are not likely to exist on the other end, so you need a real copy. I'm sure you've thought about this. multiprocessing.Array has a solution for this, which is to unbox the contained values. It won't work with tuples.
I'd be interested in more info on both the refcount freezing and the sepatate refcounts pages.
I can describe the patches: - separate refcounts replaces refcount with a pointer to refcount, and changes incref/decref. - refcount freezing lets you walk all objects and set the reference count to a magic value. incref/decref check if the refcount is frozen before working. With freezing, unlike this approach to separate refcounts, anyone that touches the refcount manually will just dirty the page and unfreeze the refcount, rather than crashing the process. Both of them will decrease performance for non-forking python code, but for forking code it can be made up for e.g. by increased worker lifetime and decreased rate of page copying, plus the whole CPU vs memory tradeoff. I legitimately don't remember the difference in performance, which is good because I'm probably not allowed to say what it was, as it was tested on our actual app and not microbenchmarks. ;)
And remember that we *do* have many examples of people using parallelized Python code in production. Are you sure you're satisfying their concerns, or whose concerns are you trying to satisfy?
Another good point. What would you suggest is the best way to find out?
I don't necessarily mean that. I mean that this thread feels like you posed an answer and I'm not sure what the question is. Is it about solving a real technical problem? What is that, and who does it affect? A new question I didn't ask before: is the problem with Python as a whole, or just CPython? -- Devin