solving multi-core Python
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them. This proposal is meant to be a shot over the bow, so to speak. I plan on putting together a more complete PEP some time in the future, with content that is more refined along with references to the appropriate online resources. Feedback appreciated! Offers to help even more so! :) -eric -------- Python's multi-core story is murky at best. Not only can we be more clear on the matter, we can improve Python's support. The result of any effort must make multi-core (i.e. parallelism) support in Python obvious, unmistakable, and undeniable (and keep it Pythonic). Currently we have several concurrency models represented via threading, multiprocessing, asyncio, concurrent.futures (plus others in the cheeseshop). However, in CPython the GIL means that we don't have parallelism, except through multiprocessing which requires trade-offs. (See Dave Beazley's talk at PyCon US 2015.) This is a situation I'd like us to solve once and for all for a couple of reasons. Firstly, it is a technical roadblock for some Python developers, though I don't see that as a huge factor. Regardless, secondly, it is especially a turnoff to folks looking into Python and ultimately a PR issue. The solution boils down to natively supporting multiple cores in Python code. This is not a new topic. For a long time many have clamored for death to the GIL. Several attempts have been made over the years and failed to do it without sacrificing single-threaded performance. Furthermore, removing the GIL is perhaps an obvious solution but not the only one. Others include Trent Nelson's PyParallels, STM, and other Python implementations.. Proposal ======= In some personal correspondence Nick Coghlan, he summarized my preferred approach as "the data storage separation of multiprocessing, with the low message passing overhead of threading". For Python 3.6: * expose subinterpreters to Python in a new stdlib module: "subinterpreters" * add a new SubinterpreterExecutor to concurrent.futures * add a queue.Queue-like type that will be used to explicitly share objects between subinterpreters This is less simple than it might sound, but presents what I consider the best option for getting a meaningful improvement into Python 3.6. Also, I'm not convinced that the word "subinterpreter" properly conveys the intent, for which subinterpreters is only part of the picture. So I'm open to a better name. Influences ======== Note that I'm drawing quite a bit of inspiration from elsewhere. The idea of using subinterpreters to get this (more) efficient isolated execution is not my own (I heard it from Nick). I have also spent quite a bit of time and effort researching for this proposal. As part of that, a number of people have provided invaluable insight and encouragement as I've prepared, including Guido, Nick, Brett Cannon, Barry Warsaw, and Larry Hastings. Additionally, Hoare's "Communicating Sequential Processes" (CSP) has been a big influence on this proposal. FYI, CSP is also the inspiration for Go's concurrency model (e.g. goroutines, channels, select). Dr. Sarah Mount, who has expertise in this area, has been kind enough to agree to collaborate and even co-author the PEP that I hope comes out of this proposal. My interest in this improvement has been building for several years. Recent events, including this year's language summit, have driven me to push for something concrete in Python 3.6. The subinterpreter Module ===================== The subinterpreters module would look something like this (a la threading/multiprocessing): settrace() setprofile() stack_size() active_count() enumerate() get_ident() current_subinterpreter() Subinterpreter(...) id is_alive() running() -> Task or None run(...) -> Task # wrapper around PyRun_*, auto-calls Task.start() destroy() Task(...) # analogous to a CSP process id exception() # other stuff? # for compatibility with threading.Thread: name ident is_alive() start() run() join() Channel(...) # shared by passing as an arg to the subinterpreter-running func # this API is a bit uncooked still... pop() push() poison() # maybe select() # maybe Note that Channel objects will necessarily be shared in common between subinterpreters (where bound). This sharing will happen when the one or more of the parameters to the function passed to Task() is a Channel. Thus the channel would be open to the (sub)interpreter calling Task() (or Subinterpreter.run()) and to the new subinterpreter. Also, other channels could be fed into such a shared channel, whereby those channels would then likewise be shared between the interpreters. I don't know yet if this module should include *all* the essential pieces to implement a complete CSP library. Given the inspiration that CSP is providing, it may make sense to support it fully. It would be interesting then if the implementation here allowed the (complete?) formalisms provided by CSP (thus, e.g. rigorous proofs of concurrent system models). I expect there will also be a _subinterpreters module with low-level implementation-specific details. Related Ideas and Details Under Consideration ==================================== Some of these are details that need to be sorted out. Some are secondary ideas that may be appropriate to address in this proposal or may need to be tabled. I have some others but these should be sufficient to demonstrate the range of points to consider. * further coalesce the (concurrency/parallelism) abstractions between threading, multiprocessing, asyncio, and this proposal * only allow one running Task at a time per subinterpreter * disallow threading within subinterpreters (with legacy support in C) + ignore/remove the GIL within subinterpreters (since they would be single-threaded) * use the GIL only in the main interpreter and for interaction between subinterpreters (and a "Local Interpreter Lock" for within a subinterpreter) * disallow forking within subinterpreters * only allow passing plain functions to Task() and Subinterpreter.run() (exclude closures, other callables) * object ownership model + read-only in all but 1 subinterpreter + RW in all subinterpreters + only allow 1 subinterpreter to have any refcounts to an object (except for channels) * only allow immutable objects to be shared between subinterpreters * for better immutability, move object ref counts into a separate table * freeze (new machinery or memcopy or something) objects to make them (at least temporarily) immutable * expose a more complete CSP implementation in the stdlib (or make the subinterpreters module more compliant) * treat the main interpreter differently than subinterpreters (or treat it exactly the same) * add subinterpreter support to asyncio (the interplay between them could be interesting) Key Dependencies ================ There are a few related tasks/projects that will likely need to be resolved before subinterpreters in CPython can be used in the proposed manner. The proposal could implemented either way, but it will help the multi-core effort if these are addressed first. * fixes to subinterpreter support (there are a couple individuals who should be able to provide the necessary insight) * PEP 432 (will simplify several key implementation details) * improvements to isolation between subinterpreters (file descriptors, env vars, others) Beyond those, the scale and technical scope of this project means that I am unlikely to be able to do all the work myself to land this in Python 3.6 (though I'd still give it my best shot). That will require the involvement of various experts. I expect that the project is divisible into multiple mostly independent pieces, so that will help. Python Implementations =================== They can correct me if I'm wrong, but from what I understand both Jython and IronPython already have subinterpreter support. I'll be soliciting feedback from the different Python implementors about subinterpreter support. C Extension Modules ================= Subinterpreters already isolate extension modules (and built-in modules, including sys). PEP 384 provides some help too. However, global state in C can easily leak data between subinterpreters, breaking the desired data isolation. This is something that will need to be addressed as part of the effort.
Eric, On 2015-06-20 5:42 PM, Eric Snow wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This is really great. Big +1 from me, and I'd be glad to help with the PEP/implementation. [...]
* only allow immutable objects to be shared between subinterpreters
Even if this is the only thing we have -- an efficient way for sharing immutable objects (such as bytes, strings, ints, and, stretching the definition of immutable, FDs) that will allow us to do a lot. Yury
It's worthwhile to consider fork as an alternative. IMO we'd get a lot out of making forking safer, easier, and more efficient. (e.g. respectively: adding an atfork registration mechanism; separating out the bits of multiprocessing that use pickle from those that don't; moving the refcount to a separate page, or allowing it to be frozen prior to a fork.) It sounds to me like this approach would use more memory than either regular threaded code or forking, so its main advantages are being cross-platform and less bug-prone. Is that right? Note: I don't count the IPC cost of forking, because at least on linux, any way to efficiently share objects between independent interpreters in separate threads can also be ported to independent interpreters in forked subprocesses, and *should* be. See also: multiprocessing.Value/Array. This is probably a good opportunity for that unification you mentioned. :) On Sat, Jun 20, 2015 at 3:04 PM, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
On 2015-06-20 5:42 PM, Eric Snow wrote:
* only allow immutable objects to be shared between subinterpreters
Even if this is the only thing we have -- an efficient way for sharing immutable objects (such as bytes, strings, ints, and, stretching the definition of immutable, FDs) that will allow us to do a lot.
+1, this has a lot of utility, and can be extended naturally to other types and circumstances. -- Devin
On Jun 20, 2015 4:55 PM, "Devin Jeanpierre" <jeanpierreda@gmail.com> wrote:
It's worthwhile to consider fork as an alternative. IMO we'd get a lot out of making forking safer, easier, and more efficient. (e.g. respectively: adding an atfork registration mechanism; separating out the bits of multiprocessing that use pickle from those that don't; moving the refcount to a separate page, or allowing it to be frozen prior to a fork.)
So leverage a common base of code with the multiprocessing module?
It sounds to me like this approach would use more memory than either regular threaded code or forking, so its main advantages are being cross-platform and less bug-prone. Is that right?
I would expect subinterpreters to use less memory. Furthermore creating them would be significantly faster. Passing objects between them would be much more efficient. And, yes, cross-platform.
Note: I don't count the IPC cost of forking, because at least on linux, any way to efficiently share objects between independent interpreters in separate threads can also be ported to independent interpreters in forked subprocesses,
How so? Subinterpreters are in the same process. For this proposal each would be on its own thread. Sharing objects between them through channels would be more efficient than IPC. Perhaps I've missed something?
and *should* be.
See also: multiprocessing.Value/Array. This is probably a good opportunity for that unification you mentioned. :)
I'll look.
On Sat, Jun 20, 2015 at 3:04 PM, Yury Selivanov <yselivanov.ml@gmail.com>
wrote:
On 2015-06-20 5:42 PM, Eric Snow wrote:
* only allow immutable objects to be shared between subinterpreters
Even if this is the only thing we have -- an efficient way for sharing immutable objects (such as bytes, strings, ints, and, stretching the definition of immutable, FDs) that will allow us to do a lot.
+1, this has a lot of utility, and can be extended naturally to other types and circumstances.
Agreed. -eric
On Sat, Jun 20, 2015 at 4:16 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Jun 20, 2015 4:55 PM, "Devin Jeanpierre" <jeanpierreda@gmail.com> wrote:
It's worthwhile to consider fork as an alternative. IMO we'd get a lot out of making forking safer, easier, and more efficient. (e.g. respectively: adding an atfork registration mechanism; separating out the bits of multiprocessing that use pickle from those that d, I still disagreeon't; moving the refcount to a separate page, or allowing it to be frozen prior to a fork.)
So leverage a common base of code with the multiprocessing module?
What is this question in response to? I don't understand.
I would expect subinterpreters to use less memory. Furthermore creating them would be significantly faster. Passing objects between them would be much more efficient. And, yes, cross-platform.
Maybe I don't understand how subinterpreters work. AIUI, the whole point of independent subinterpreters is that they share no state. So if I have a web server, each independent serving thread has to do all of the initialization (import HTTP libraries, etc.), right? Compare with forking, where the initialization is all done and then you fork, and you are immediately ready to serve, using the data structures shared with all the other workers, which is only copied when it is written to. So forking starts up faster and uses less memory (due to shared memory.) Re passing objects, see below. I do agree it's cross-platform, but right now that's the only thing I agree with.
Note: I don't count the IPC cost of forking, because at least on linux, any way to efficiently share objects between independent interpreters in separate threads can also be ported to independent interpreters in forked subprocesses,
How so? Subinterpreters are in the same process. For this proposal each would be on its own thread. Sharing objects between them through channels would be more efficient than IPC. Perhaps I've missed something?
You might be missing that memory can be shared between processes, not just threads, but I don't know. The reason passing objects between processes is so slow is currently *nearly entirely* the cost of serialization. That is, it's the fact that you are passing an object to an entirely separate interpreter, and need to serialize the whole object graph and so on. If you can make that fast without serialization, for shared memory threads, then all the serialization becomes unnecessary, and you can either write to a pipe (fast, if it's a non-container), or used shared memory from the beginning (instantaneous). This is possible on any POSIX OS. Linux lets you go even further. -- Devin
On Sun, Jun 21, 2015 at 4:56 AM Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
On Sat, Jun 20, 2015 at 4:16 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Jun 20, 2015 4:55 PM, "Devin Jeanpierre" <jeanpierreda@gmail.com>
wrote:
It's worthwhile to consider fork as an alternative. IMO we'd get a lot out of making forking safer, easier, and more efficient. (e.g. respectively: adding an atfork registration mechanism; separating out the bits of multiprocessing that use pickle from those that d, I still
disagreeon't;
moving the refcount to a separate page, or allowing it to be frozen prior to a fork.)
So leverage a common base of code with the multiprocessing module?
What is this question in response to? I don't understand.
I would expect subinterpreters to use less memory. Furthermore creating them would be significantly faster. Passing objects between them would be much more efficient. And, yes, cross-platform.
Maybe I don't understand how subinterpreters work. AIUI, the whole point of independent subinterpreters is that they share no state. So if I have a web server, each independent serving thread has to do all of the initialization (import HTTP libraries, etc.), right? Compare with forking, where the initialization is all done and then you fork, and you are immediately ready to serve, using the data structures shared with all the other workers, which is only copied when it is written to.
Unfortunately CPython subinterpreters do share some state, though it is not visible to the running code in many cases. Thus the other mentions of "wouldn't it be nice if CPython didn't assume a single global state per process" (100% agreed, but tangential to this discussion)... https://docs.python.org/3/c-api/init.html#sub-interpreter-support You are correct that some things that could make sense to share, such as imported modules, would not be shared as they are in a forked environment. This is an important oddity of subinterpreters: They have to re-import everything other than extension modules. When you've got a big process with a ton of modules (like, say, 100s of protocol buffers...), that's going to be a non-starter (pun intended) for the use of threads+subinterpreters as a fast form of concurrency if they need to import most of those from each subinterpreter. startup latency and cpu usage += lots. (possibly uses more memory as well but given our existing refcount implementation forcing needless PyObject page writes during a read causing fork to copy-on-write... impossible to guess) What this means for subinterpreters in this case is not much different from starting up multiple worker processes: You need to start them up and wait for them to be ready to serve, then reuse them as long as feasible before recycling them to start up a new one. The startup cost is high. I'm not entirely sold on this overall proposal, but I think a result of it *could* be to make our subinterpreter support better which would be a good thing. We have had to turn people away from subinterpreters in the past for use as part of their multithreaded C++ server where they wanted to occasionally run some Python code in embedded interpreters as part of serving some requests. Doing that would suddenly single thread their application (GIIIIIIL!) for all requests currently executing Python code despite multiple subinterpreters. The general advice for that: Run multiple Python processes and make RPCs to those from the C++ code. It allows for parallelism and ultimately scales better, if ever needed, as it can be easily spread across machines. Which one is more complex to maintain? Good question. -gps
Re passing objects, see below.
I do agree it's cross-platform, but right now that's the only thing I agree with.
Note: I don't count the IPC cost of forking, because at least on linux, any way to efficiently share objects between independent interpreters in separate threads can also be ported to independent interpreters in forked subprocesses,
How so? Subinterpreters are in the same process. For this proposal each would be on its own thread. Sharing objects between them through channels would be more efficient than IPC. Perhaps I've missed something?
You might be missing that memory can be shared between processes, not just threads, but I don't know.
The reason passing objects between processes is so slow is currently *nearly entirely* the cost of serialization. That is, it's the fact that you are passing an object to an entirely separate interpreter, and need to serialize the whole object graph and so on. If you can make that fast without serialization, for shared memory threads, then all the serialization becomes unnecessary, and you can either write to a pipe (fast, if it's a non-container), or used shared memory from the beginning (instantaneous). This is possible on any POSIX OS. Linux lets you go even further.
-- Devin _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 23 Jun 2015 03:37, "Gregory P. Smith" <greg@krypto.org> wrote:
On Sun, Jun 21, 2015 at 4:56 AM Devin Jeanpierre <jeanpierreda@gmail.com>
On Sat, Jun 20, 2015 at 4:16 PM, Eric Snow <ericsnowcurrently@gmail.com>
wrote:
On Jun 20, 2015 4:55 PM, "Devin Jeanpierre" <jeanpierreda@gmail.com>
wrote:
It's worthwhile to consider fork as an alternative. IMO we'd get a lot out of making forking safer, easier, and more efficient. (e.g. respectively: adding an atfork registration mechanism; separating out the bits of multiprocessing that use pickle from those that d, I
still disagreeon't;
moving the refcount to a separate page, or allowing it to be frozen prior to a fork.)
So leverage a common base of code with the multiprocessing module?
What is this question in response to? I don't understand.
I would expect subinterpreters to use less memory. Furthermore creating them would be significantly faster. Passing objects between them would be much more efficient. And, yes, cross-platform.
Maybe I don't understand how subinterpreters work. AIUI, the whole point of independent subinterpreters is that they share no state. So if I have a web server, each independent serving thread has to do all of the initialization (import HTTP libraries, etc.), right? Compare with forking, where the initialization is all done and then you fork, and you are immediately ready to serve, using the data structures shared with all the other workers, which is only copied when it is written to.
Unfortunately CPython subinterpreters do share some state, though it is not visible to the running code in many cases. Thus the other mentions of "wouldn't it be nice if CPython didn't assume a single global state per
wrote: process" (100% agreed, but tangential to this discussion)...
https://docs.python.org/3/c-api/init.html#sub-interpreter-support
You are correct that some things that could make sense to share, such as
imported modules, would not be shared as they are in a forked environment.
This is an important oddity of subinterpreters: They have to re-import
everything other than extension modules. When you've got a big process with a ton of modules (like, say, 100s of protocol buffers...), that's going to be a non-starter (pun intended) for the use of threads+subinterpreters as a fast form of concurrency if they need to import most of those from each subinterpreter. startup latency and cpu usage += lots. (possibly uses more memory as well but given our existing refcount implementation forcing needless PyObject page writes during a read causing fork to copy-on-write... impossible to guess)
What this means for subinterpreters in this case is not much different
from starting up multiple worker processes: You need to start them up and wait for them to be ready to serve, then reuse them as long as feasible before recycling them to start up a new one. The startup cost is high. While I don't believe it's clear from the current text in the PEP (mostly because I only figured it out while hacking on the prototype implementation), PEP 432 should actually give us much better control over how subinterpreters are configured, as many more interpreter settings move out of global variables and into the interpreter state: https://www.python.org/dev/peps/pep-0432/ (the global variables will still exist, but primarily as an input to the initial configuration of the main interpreter) The current state of that work can be seen at https://bitbucket.org/ncoghlan/cpython_sandbox/compare/pep432_modular_bootst... While a lot of things are broken there, it's at least to the point where it can start running the regression test suite under the new 2-phase initialisation model. Cheers, Nick.
"Gregory P. Smith" <greg@krypto.org> wrote:
What this means for subinterpreters in this case is not much different from starting up multiple worker processes: You need to start them up and wait for them to be ready to serve, then reuse them as long as feasible before recycling them to start up a new one. The startup cost is high.
The statup cost for worker processes is high on Windows. It is very small on nearly any other OS. Sturla
On Mon, Jun 22, 2015 at 3:51 PM Sturla Molden <sturla.molden@gmail.com> wrote:
"Gregory P. Smith" <greg@krypto.org> wrote:
What this means for subinterpreters in this case is not much different from starting up multiple worker processes: You need to start them up and wait for them to be ready to serve, then reuse them as long as feasible before recycling them to start up a new one. The startup cost is high.
The statup cost for worker processes is high on Windows. It is very small on nearly any other OS.
While I understand that Windows adds some overhead there, startup time for Python worker processes is high on all OSes. Python startup is slow in general. It slows down further based on the modules you must import before you can begin work. -gps
On Mon, Jun 22, 2015 at 4:29 PM, Gregory P. Smith <greg@krypto.org> wrote:
On Mon, Jun 22, 2015 at 3:51 PM Sturla Molden <sturla.molden@gmail.com> wrote:
"Gregory P. Smith" <greg@krypto.org> wrote:
What this means for subinterpreters in this case is not much different from starting up multiple worker processes: You need to start them up and wait for them to be ready to serve, then reuse them as long as feasible before recycling them to start up a new one. The startup cost is high.
The statup cost for worker processes is high on Windows. It is very small on nearly any other OS.
While I understand that Windows adds some overhead there, startup time for Python worker processes is high on all OSes.
Python startup is slow in general. It slows down further based on the modules you must import before you can begin work.
Python does *very* little work on fork, which is what Sturla is alluding to. (Fork doesn't exist on Windows.) The only part I've found forking to be slow with is if you need to delay initialization of a thread pool and everything that depends on a thread pool until after the fork. This could hypothetically be made faster with subinterpreters if the thread pool was shared among all subinterpreters (e.g. if it was written in C.), but I would *expect* fork to be faster overall. That said, worker startup time is not actually very interesting anyway, since workers should restart rarely. I think its biggest impact is probably the time it takes to start your entire task from scratch. -- Devin
On 23/06/15 01:29, Gregory P. Smith wrote:
While I understand that Windows adds some overhead there, startup time for Python worker processes is high on all OSes.
No it is not. A fork() will clone the process. You don't need to run any initialization code after that. You don't need to start a new Python interpreter -- you already have one. You don't need to run module imports -- they are already imported. You don't need to pickle and build Python objects -- they are already there. Everything you had in the parent process is ready to use the child process. This magic happens so fast it is comparable to the time it takes Windows to start a thread. On Windows, CreateProcess starts an "almost empty" process. You therefore have a lot of setup code to run. This is what makes starting Python processes with multiprocessing so much slower on Windows. It is not that Windows processes are more hevy-weight than threads, they are, but the real issue is all the setup code you need to run. On Linux and Mac, you don't need to run any setup code code after a fork(). Sturla
On 23.06.2015 13:57, Sturla Molden wrote:
On 23/06/15 01:29, Gregory P. Smith wrote:
While I understand that Windows adds some overhead there, startup time for Python worker processes is high on all OSes.
No it is not.
A fork() will clone the process. You don't need to run any initialization code after that. You don't need to start a new Python interpreter -- you already have one. You don't need to run module imports -- they are already imported. You don't need to pickle and build Python objects -- they are already there. Everything you had in the parent process is ready to use the child process. This magic happens so fast it is comparable to the time it takes Windows to start a thread.
To be fair, you will nevertheless get a slowdown when copy-on-write kicks in while first using whatever was cloned from the parent. This is nothing which blocks execution, but slows down execution. That is no time which can directly be measured during the fork() call, but I would still count it into start up cost. regards, jwi
On 23/06/15 15:14, Jonas Wielicki wrote:
To be fair, you will nevertheless get a slowdown when copy-on-write kicks in while first using whatever was cloned from the parent. This is nothing which blocks execution, but slows down execution.
Yes, particularly because of reference counts. Unfortunately Python stores refcounts within the PyObject struct. And when a refcount is updated a copy of the entire 4 KB page is triggered. There would be fare less of this if refcounts was kept in dedicated pages. Sturla
On Tue, Jun 23, 2015 at 7:55 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 23/06/15 15:14, Jonas Wielicki wrote:
To be fair, you will nevertheless get a slowdown when copy-on-write kicks in while first using whatever was cloned from the parent. This is nothing which blocks execution, but slows down execution.
Yes, particularly because of reference counts. Unfortunately Python stores refcounts within the PyObject struct. And when a refcount is updated a copy of the entire 4 KB page is triggered. There would be fare less of this if refcounts was kept in dedicated pages.
A coworker of mine wrote a patch to Python that allows you to freeze refcounts for all existing objects before forking, if the correct compile options are set. This adds overhead to incref/decref, but dramatically changes the python+fork memory usage story. (I haven't personally played with it much, but it sounds decent.) If there's any interest I can try to upstream this change, guarded behind a compiler flag. We've also tried moving refcounts to their own pages, like you and Nick suggest, but it breaks a *lot* of third-party code. I can try to upstream it. If it's guarded by a compiler flag it is probably still useful, just any users would have to grep through their dependencies to make sure nothing directly accesses the refcount. (The stdlib can be made to work.) It sounds like it would also be useful for the main project in the topic of this thread, so I imagine there's more momentum behind it. -- Devin
On Tue, Jun 23, 2015 at 5:32 PM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
A coworker of mine wrote a patch to Python that allows you to freeze refcounts for all existing objects before forking, if the correct compile options are set. This adds overhead to incref/decref, but dramatically changes the python+fork memory usage story. (I haven't personally played with it much, but it sounds decent.) If there's any interest I can try to upstream this change, guarded behind a compiler flag.
We've also tried moving refcounts to their own pages, like you and Nick suggest, but it breaks a *lot* of third-party code. I can try to upstream it. If it's guarded by a compiler flag it is probably still useful, just any users would have to grep through their dependencies to make sure nothing directly accesses the refcount. (The stdlib can be made to work.) It sounds like it would also be useful for the main project in the topic of this thread, so I imagine there's more momentum behind it.
I'd be interested in more info on both the refcount freezing and the sepatate refcounts pages. -eric
On Tue, Jun 23, 2015 at 5:32 PM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
We've also tried moving refcounts to their own pages, like you and Nick suggest, but it breaks a *lot* of third-party code. I can try to upstream it. If it's guarded by a compiler flag it is probably still useful, just any users would have to grep through their dependencies to make sure nothing directly accesses the refcount. (The stdlib can be made to work.) It sounds like it would also be useful for the main project in the topic of this thread, so I imagine there's more momentum behind it.
Any indication of the performance impact? -eric
On Mon, Jun 22, 2015 at 10:37 AM, Gregory P. Smith <greg@krypto.org> wrote:
This is an important oddity of subinterpreters: They have to re-import everything other than extension modules. When you've got a big process with a ton of modules (like, say, 100s of protocol buffers...), that's going to be a non-starter (pun intended) for the use of threads+subinterpreters as a fast form of concurrency if they need to import most of those from each subinterpreter. startup latency and cpu usage += lots. (possibly uses more memory as well but given our existing refcount implementation forcing needless PyObject page writes during a read causing fork to copy-on-write... impossible to guess)
What this means for subinterpreters in this case is not much different from starting up multiple worker processes: You need to start them up and wait for them to be ready to serve, then reuse them as long as feasible before recycling them to start up a new one. The startup cost is high.
One possibility would be for subinterpreters to copy modules from the main interpreter -- I guess your average module is mostly dicts, strings, type objects, and functions; strings and functions are already immutable and could be shared without copying, and I guess copying the dicts and type objects into the subinterpreter is much cheaper than hitting the disk etc. to do a real import. (Though certainly not free.) This would have interesting semantic implications -- it would give similar effects to fork(), with subinterpreters starting from a snapshot of the main interpreter's global state.
I'm not entirely sold on this overall proposal, but I think a result of it could be to make our subinterpreter support better which would be a good thing.
We have had to turn people away from subinterpreters in the past for use as part of their multithreaded C++ server where they wanted to occasionally run some Python code in embedded interpreters as part of serving some requests. Doing that would suddenly single thread their application (GIIIIIIL!) for all requests currently executing Python code despite multiple subinterpreters.
I've also talked to HPC users who discovered this problem the hard way (e.g. http://www-atlas.lbl.gov/, folks working on the Large Hadron Collider) -- they've been using Python as an extension language in some large physics codes but are now porting those bits to C++ because of the GIL issues. (In this context startup overhead should be easily amortized, but switching to an RPC model is not going to happen.) -n -- Nathaniel J. Smith -- http://vorpus.org
On Tue, Jun 23, 2015 at 9:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
One possibility would be for subinterpreters to copy modules from the main interpreter -- I guess your average module is mostly dicts, strings, type objects, and functions; strings and functions are already immutable and could be shared without copying, and I guess copying the dicts and type objects into the subinterpreter is much cheaper than hitting the disk etc. to do a real import. (Though certainly not free.)
FWIW, functions aren't immutable, but code objects are. ChrisA
On 23 June 2015 at 10:03, Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jun 23, 2015 at 9:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
One possibility would be for subinterpreters to copy modules from the main interpreter -- I guess your average module is mostly dicts, strings, type objects, and functions; strings and functions are already immutable and could be shared without copying, and I guess copying the dicts and type objects into the subinterpreter is much cheaper than hitting the disk etc. to do a real import. (Though certainly not free.)
FWIW, functions aren't immutable, but code objects are.
Anything we come up with for optimised data sharing via channels could be applied to passing a prebuilt sys.modules dictionary through to subinterpreters. The key for me is to start from a well-defined "shared nothing" semantic model, but then look for ways to exploit the fact that we actually *are* running in the same address space to avoid copy objects. The current reference-counts-embedded-in-the-object-structs memory layout also plays havoc with the all-or-nothing page level copy-on-write semantics used by the fork() syscall at the operating system layer, so some of the ideas we've been considering (specifically, those related to moving the reference counter bookkeeping out of the object structs themselves) would potentially help with that as well (but would also have other hard to predict performance consequences). There's a reason Eric announced this as the *start* of a research project, rather than as a finished proposal - while it seems conceptually sound overall, there are a vast number of details to be considered that will no doubt hold a great many devils :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Jun 23, 2015, at 01:52 PM, Nick Coghlan wrote:
The current reference-counts-embedded-in-the-object-structs memory layout also plays havoc with the all-or-nothing page level copy-on-write semantics used by the fork() syscall at the operating system layer, so some of the ideas we've been considering (specifically, those related to moving the reference counter bookkeeping out of the object structs themselves) would potentially help with that as well (but would also have other hard to predict performance consequences).
A crazy offshoot idea would be something like Emacs' unexec, where during the build process you could preload a bunch of always-used immutable modules, then freeze the state in such a way that starting up again later would be much faster, because the imports (and probably more importantly, the searching) could be avoided. Cheers, -Barry
On Tue, Jun 23, 2015 at 9:01 AM Barry Warsaw <barry@python.org> wrote:
On Jun 23, 2015, at 01:52 PM, Nick Coghlan wrote:
The current reference-counts-embedded-in-the-object-structs memory layout also plays havoc with the all-or-nothing page level copy-on-write semantics used by the fork() syscall at the operating system layer, so some of the ideas we've been considering (specifically, those related to moving the reference counter bookkeeping out of the object structs themselves) would potentially help with that as well (but would also have other hard to predict performance consequences).
A crazy offshoot idea would be something like Emacs' unexec, where during the build process you could preload a bunch of always-used immutable modules, then freeze the state in such a way that starting up again later would be much faster, because the imports (and probably more importantly, the searching) could be avoided.
I actually would like something like this for Python, but I want it to work with hash randomization rather than freezing a single fixed hash seed. That means you'd need to record the location of all hash tables and cached hashes and fix them up after loading such a binary image at process start time, much like processing relocations when loading a binary executable. Non trivial. -gps
On Tue, Jun 23, 2015 at 10:01 AM, Barry Warsaw <barry@python.org> wrote:
A crazy offshoot idea would be something like Emacs' unexec, where during the build process you could preload a bunch of always-used immutable modules, then freeze the state in such a way that starting up again later would be much faster, because the imports (and probably more importantly, the searching) could be avoided.
+1 -eric
Barry Warsaw writes:
A crazy offshoot idea would be something like Emacs' unexec, where during the build process you could preload a bunch of always-used immutable modules,
XEmacs doesn't do this any more if it can avoid it, we now have a portable dumper that we use on almost all platforms. And everybody at GNU Emacs who works with the unexec code wants to get rid of it. XEmacs's legacy unexec requires defeating address space randomization as well as certain optimizations that combine segments. I believe Emacs's does too. From a security standpoint, Emacsen are a child's garden of diseases and it will take decades, maybe centuries, to fix that, so those aren't huge problems for us. But I suppose Python needs to be able to work and play nicely with high-security environments, and would like to take advantage of security-oriented OS facilities like base address randomization. That kind of thing hugely complicates unexec -- last I heard it wasn't just "way too much work to be worth it", the wonks who created the portable dumper didn't know how to do it and weren't sure it could be done. XEmacs's default "portable dumper" is a poor man's relocating loader. I don't know exactly how it works, can't give details. Unlike the unexecs of some Lisps, however, this is a "do it once per build process" design. There's no explicit provision for keeping multiple dumpfiles around, although I believe it can be done "by hand" by someone with a little bit of knowledge. The reason for this is that the dumpfile is actually added to the executable. Regarding performance, the dumper itself is fast enough to be imperceptible to humans at load time, and doesn't take very long to build the dump file containing the "frozen" objects when building. I suspect Python has applications where it would be like to be faster than that, but I don't have benchmarks so don't know if this approach would be fast enough. This approach has the feature (disadvantage?) that some objects can't be dumped including editing buffers, network connections, and processes. I suppose those restrictions are very similar to the restrictions imposed by pickle. If somebody wants to know more about the portable dumper, I can probably connect them with the authors of that feature.
On Mon, Jun 22, 2015 at 9:52 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 23 June 2015 at 10:03, Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jun 23, 2015 at 9:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
One possibility would be for subinterpreters to copy modules from the main interpreter -- I guess your average module is mostly dicts, strings, type objects, and functions; strings and functions are already immutable and could be shared without copying, and I guess copying the dicts and type objects into the subinterpreter is much cheaper than hitting the disk etc. to do a real import. (Though certainly not free.)
FWIW, functions aren't immutable, but code objects are.
Anything we come up with for optimised data sharing via channels could be applied to passing a prebuilt sys.modules dictionary through to subinterpreters.
The key for me is to start from a well-defined "shared nothing" semantic model, but then look for ways to exploit the fact that we actually *are* running in the same address space to avoid copy objects.
Exactly.
The current reference-counts-embedded-in-the-object-structs memory layout also plays havoc with the all-or-nothing page level copy-on-write semantics used by the fork() syscall at the operating system layer, so some of the ideas we've been considering (specifically, those related to moving the reference counter bookkeeping out of the object structs themselves) would potentially help with that as well (but would also have other hard to predict performance consequences).
There's a reason Eric announced this as the *start* of a research project, rather than as a finished proposal - while it seems conceptually sound overall, there are a vast number of details to be considered that will no doubt hold a great many devils :)
And they keep multiplying! :) -eric
On Mon, Jun 22, 2015 at 5:59 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jun 22, 2015 at 10:37 AM, Gregory P. Smith <greg@krypto.org> wrote:
...
One possibility would be for subinterpreters to copy modules from the main interpreter -- I guess your average module is mostly dicts, strings, type objects, and functions; strings and functions are already immutable and could be shared without copying, and I guess copying the dicts and type objects into the subinterpreter is much cheaper than hitting the disk etc. to do a real import. (Though certainly not free.)
Yeah, I think there are a number of mechanisms we can explore to improve the efficiency of subinterpreter startup (and sharing).
This would have interesting semantic implications -- it would give similar effects to fork(), with subinterpreters starting from a snapshot of the main interpreter's global state.
I'm not entirely sold on this overall proposal, but I think a result of it could be to make our subinterpreter support better which would be a good thing.
We have had to turn people away from subinterpreters in the past for use as part of their multithreaded C++ server where they wanted to occasionally run some Python code in embedded interpreters as part of serving some requests. Doing that would suddenly single thread their application (GIIIIIIL!) for all requests currently executing Python code despite multiple subinterpreters.
I've also talked to HPC users who discovered this problem the hard way (e.g. http://www-atlas.lbl.gov/, folks working on the Large Hadron Collider) -- they've been using Python as an extension language in some large physics codes but are now porting those bits to C++ because of the GIL issues. (In this context startup overhead should be easily amortized, but switching to an RPC model is not going to happen.)
Would this proposal make a difference for them? -eric
On Tue, Jun 23, 2015 at 11:11 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Mon, Jun 22, 2015 at 5:59 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jun 22, 2015 at 10:37 AM, Gregory P. Smith <greg@krypto.org> wrote:
We have had to turn people away from subinterpreters in the past for use as part of their multithreaded C++ server where they wanted to occasionally run some Python code in embedded interpreters as part of serving some requests. Doing that would suddenly single thread their application (GIIIIIIL!) for all requests currently executing Python code despite multiple subinterpreters.
I've also talked to HPC users who discovered this problem the hard way (e.g. http://www-atlas.lbl.gov/, folks working on the Large Hadron Collider) -- they've been using Python as an extension language in some large physics codes but are now porting those bits to C++ because of the GIL issues. (In this context startup overhead should be easily amortized, but switching to an RPC model is not going to happen.)
Would this proposal make a difference for them?
I'm not sure -- it was just a conversation, so I've never seen their actual code. I'm pretty sure they're still on py2, for one thing :-). But putting that aside, I *think* it potentially could help -- my guess is that at a high level they have an API where they basically want to register a callback once, and then call it in parallel from multiple threads. This kind of usage would require some extra machinery, I guess, to spawn a subinterpreter for each thread and import the relevant libraries so the callback could run, but I can't see any reason one couldn't build that on top of the mechanisms you're talking about. -n -- Nathaniel J. Smith -- http://vorpus.org
On Sun, Jun 21, 2015 at 5:55 AM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
On Sat, Jun 20, 2015 at 4:16 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Jun 20, 2015 4:55 PM, "Devin Jeanpierre" <jeanpierreda@gmail.com> wrote:
It's worthwhile to consider fork as an alternative. IMO we'd get a lot out of making forking safer, easier, and more efficient. (e.g. respectively: adding an atfork registration mechanism; separating out the bits of multiprocessing that use pickle from those that d, I still disagreeon't; moving the refcount to a separate page, or allowing it to be frozen prior to a fork.)
So leverage a common base of code with the multiprocessing module?
What is this question in response to? I don't understand.
It sounded like you were suggesting that we factor out a common code base that could be used by multiprocessing and the other machinery and that only multiprocessing would keep the pickle-related code.
I would expect subinterpreters to use less memory. Furthermore creating them would be significantly faster. Passing objects between them would be much more efficient. And, yes, cross-platform.
Maybe I don't understand how subinterpreters work. AIUI, the whole point of independent subinterpreters is that they share no state. So if I have a web server, each independent serving thread has to do all of the initialization (import HTTP libraries, etc.), right?
Yes. However, I expect that we could mitigate that cost to some extent.
Compare with forking, where the initialization is all done and then you fork, and you are immediately ready to serve, using the data structures shared with all the other workers, which is only copied when it is written to. So forking starts up faster and uses less memory (due to shared memory.)
But we are aiming for a share-nothing model with an efficient object-passing mechanism. Furthermore, subinterpreters do not have to be single-use. My proposal includes running tasks in an existing subinterpreter (e.g. executor pool), so that start-up cost is mitigated in cases where it matters. Note that ultimately my goal is to make it obvious and undeniable that Python (3.6+) has a good multi-core story. In my proposal, subinterpreters are a means to an end. If there's a better solution then great! As long as the real goal is met I'll be satisfied. :) For now I'm still confident that the subinterpreter approach is the best option for meeting the goal.
Re passing objects, see below.
I do agree it's cross-platform, but right now that's the only thing I agree with.
Note: I don't count the IPC cost of forking, because at least on linux, any way to efficiently share objects between independent interpreters in separate threads can also be ported to independent interpreters in forked subprocesses,
How so? Subinterpreters are in the same process. For this proposal each would be on its own thread. Sharing objects between them through channels would be more efficient than IPC. Perhaps I've missed something?
You might be missing that memory can be shared between processes, not just threads, but I don't know.
The reason passing objects between processes is so slow is currently *nearly entirely* the cost of serialization. That is, it's the fact that you are passing an object to an entirely separate interpreter, and need to serialize the whole object graph and so on. If you can make that fast without serialization,
That is a worthy goal!
for shared memory threads, then all the serialization becomes unnecessary, and you can either write to a pipe (fast, if it's a non-container), or used shared memory from the beginning (instantaneous). This is possible on any POSIX OS. Linux lets you go even further.
And this is faster than passing objects around within the same process? Does it play well with Python's memory model? -eric
I'm going to break mail client threading and also answer some of your other emails here. On Tue, Jun 23, 2015 at 10:26 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
It sounded like you were suggesting that we factor out a common code base that could be used by multiprocessing and the other machinery and that only multiprocessing would keep the pickle-related code.
Yes, I like that idea a lot.
Compare with forking, where the initialization is all done and then you fork, and you are immediately ready to serve, using the data structures shared with all the other workers, which is only copied when it is written to. So forking starts up faster and uses less memory (due to shared memory.)
But we are aiming for a share-nothing model with an efficient object-passing mechanism. Furthermore, subinterpreters do not have to be single-use. My proposal includes running tasks in an existing subinterpreter (e.g. executor pool), so that start-up cost is mitigated in cases where it matters.
Note that ultimately my goal is to make it obvious and undeniable that Python (3.6+) has a good multi-core story. In my proposal, subinterpreters are a means to an end. If there's a better solution then great! As long as the real goal is met I'll be satisfied. :) For now I'm still confident that the subinterpreter approach is the best option for meeting the goal.
Ahead of time: the following is my opinion. My opinions are my own, and bizarre, unlike the opinions of my employer and coworkers. (Who are also reading this maybe.) So there's two reasons I can think of to use threads for CPU parallelism: - My thing does a lot of parallel work, and so I want to save on memory by sharing an address space This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this. - My thing does a lot of communication, and so I want fast communication through a shared address space This can become a pressing concern immediately, and so is a more visible issue. However, it's also a non-problem for many kinds of tasks which just take requests in and put output back out, without talking with other members of the pool (e.g. writing an RPC server or HTTP server.) I would also speculate that once you're on many machines, unless you're very specific with your design, RPC costs dominate IPC costs to the point where optimizing IPC doesn't do a lot for you. On Unix, IPC can be free or cheap due to shared memory. Threads really aren't all that important, and if we need them, we have them. When people tell me in #python that multicore in Python is bad because of the GIL, I point them at fork and at C extensions, but also at PyPy-STM and Jython. Everything has problems, but then so does this proposal, right?
And this is faster than passing objects around within the same process? Does it play well with Python's memory model?
As far as whether it plays with the memory model, multiprocessing.Value() just works, today. To make it even lower overhead (not construct an int PyObject* on the fly), you need to change things, e.g. the way refcounts work. I think it's possibly feasible. If not, at least the overhead would be negligible. Same applies to strings and other non-compound datatypes. Compound datatypes are hard even for the subinterpreter case, just because the objects you're referring to are not likely to exist on the other end, so you need a real copy. I'm sure you've thought about this. multiprocessing.Array has a solution for this, which is to unbox the contained values. It won't work with tuples.
I'd be interested in more info on both the refcount freezing and the sepatate refcounts pages.
I can describe the patches: - separate refcounts replaces refcount with a pointer to refcount, and changes incref/decref. - refcount freezing lets you walk all objects and set the reference count to a magic value. incref/decref check if the refcount is frozen before working. With freezing, unlike this approach to separate refcounts, anyone that touches the refcount manually will just dirty the page and unfreeze the refcount, rather than crashing the process. Both of them will decrease performance for non-forking python code, but for forking code it can be made up for e.g. by increased worker lifetime and decreased rate of page copying, plus the whole CPU vs memory tradeoff. I legitimately don't remember the difference in performance, which is good because I'm probably not allowed to say what it was, as it was tested on our actual app and not microbenchmarks. ;)
And remember that we *do* have many examples of people using parallelized Python code in production. Are you sure you're satisfying their concerns, or whose concerns are you trying to satisfy?
Another good point. What would you suggest is the best way to find out?
I don't necessarily mean that. I mean that this thread feels like you posed an answer and I'm not sure what the question is. Is it about solving a real technical problem? What is that, and who does it affect? A new question I didn't ask before: is the problem with Python as a whole, or just CPython? -- Devin
On 25/06/15 00:10, Devin Jeanpierre wrote:
So there's two reasons I can think of to use threads for CPU parallelism:
- My thing does a lot of parallel work, and so I want to save on memory by sharing an address space
This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this.
This might not be a valid concern. Sharing address space means sharing *virtual memory*. Presumably what they really want is to save *physical memory*. Two processes can map the same physical memory into virtual memory.
- My thing does a lot of communication, and so I want fast communication through a shared address space
This can become a pressing concern immediately, and so is a more visible issue.
This is a valid argument. It is mainly a concern for those who use deeply nested Python objects though.
On Unix, IPC can be free or cheap due to shared memory.
This is also the case on Windows. IPC mechanisms like pipes, fifos, Unix domain sockets are also very cheap on Unix. Pipes are also very cheap on Windows, as are tcp sockets on localhost. Windows named pipes are similar to Unix domain sockets in performance.
Same applies to strings and other non-compound datatypes. Compound datatypes are hard even for the subinterpreter case, just because the objects you're referring to are not likely to exist on the other end, so you need a real copy.
Yes. With a "share nothing" message-passing approach, one will have to make deep copies of any mutable object. And even though a tuple can be immutable, it could still contain mutable objects. It is really hard to get around the pickle overhead with subinterpreters. Since the pickle overhead is huge compared to the low-level IPC, there is very little to save in this manner.
- separate refcounts replaces refcount with a pointer to refcount, and changes incref/decref. - refcount freezing lets you walk all objects and set the reference count to a magic value. incref/decref check if the refcount is frozen before working.
With freezing, unlike this approach to separate refcounts, anyone that touches the refcount manually will just dirty the page and unfreeze the refcount, rather than crashing the process.
Both of them will decrease performance for non-forking python code,
Freezing has little impact on a modern CPU with branch prediction. On GCC we can also use __builtin_expect to make sure the optimal code is generated. This is a bit similar to using typed memoryviews and NumPy arrays in Cython with and without bounds checking. A pragma like @cython.boundscheck(False) have little benefit for the performance because of the CPU's branch prediction. The CPU knows it can expect the bounds check to pass, and only if it fails will it have to flush the pipeline. But if the bounds check passes the pipeline need not be flushed, and performance wise it will be as if the test were never there. This has greatly improved the last decade, particularly because processors have been optimized for running languages like Java and .NET efficiently. A check for a thawed refcount would be similarly cheap. Keeping reference counts in extra pages could impair performance, but mostly if multiple threads are allowed to access the same page. Because of hierachical memory, the extra pointer lookup should not matter much. Modern CPUs have evolved to solve the aliasing problem that formerly made Fortran code run faster than similar C code. Today C code tends to be faster than similar Fortran. This helps if we keep refcounts in a separate page, and the compiler cannot know what the pointer actually refers to and what it might alias. 10 or 15 years ago it would have been a performance killer, but not today. Sturla
On Wed, Jun 24, 2015 at 4:30 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 25/06/15 00:10, Devin Jeanpierre wrote:
So there's two reasons I can think of to use threads for CPU parallelism:
- My thing does a lot of parallel work, and so I want to save on memory by sharing an address space
This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this.
This might not be a valid concern. Sharing address space means sharing *virtual memory*. Presumably what they really want is to save *physical memory*. Two processes can map the same physical memory into virtual memory.
Yeah, physical memory. I agree, processes with shared memory can be made to work in practice. Although, threads are better for memory usage, by defaulting to sharing even on write. (Good for memory, maybe not so good for bug-freedom...) So from my perspective, this is the hard problem in multicore python. My views may be skewed by the peculiarities of the one major app I've worked on.
Same applies to strings and other non-compound datatypes. Compound datatypes are hard even for the subinterpreter case, just because the objects you're referring to are not likely to exist on the other end, so you need a real copy.
Yes.
With a "share nothing" message-passing approach, one will have to make deep copies of any mutable object. And even though a tuple can be immutable, it could still contain mutable objects. It is really hard to get around the pickle overhead with subinterpreters. Since the pickle overhead is huge compared to the low-level IPC, there is very little to save in this manner.
I think this is giving up too easily. Here's a stupid idea for sharable interpreter-specific objects: You keep a special heap for immutable object refcounts, where each thread/process has its own region in the heap. Refcount locations are stored as offsets into the thread local heap, and incref does ++*(threadlocal_refcounts + refcount_offset); Then for the rest of a pyobject's memory, we share by default and introduce a marker for which thread originated it. Any non-threadsafe operations can check if the originating thread id is the same as the current thread id, and raise an exception if not, before even reading the memory at all. So it introduces an overhead to accessing mutable objects. Also, this won't work with extension objects that don't check, those just get shared and unsafely mutate and crash. This also introduces the possibility of sharing mutable objects between interpreters, if the objects themselves choose to implement fine-grained locking. And it should work fine with fork if we change how the refcount heap is allocated, to use mmap or whatever. This is probably not acceptable for real, but I just mean to show with a straw man that the problem can be attacked. -- Devin
On 25/06/15 02:09, Devin Jeanpierre wrote:
Although, threads are better for memory usage, by defaulting to sharing even on write. (Good for memory, maybe not so good for bug-freedom...)
I am not sure. Code written to use OpenMP tend to have less bugs than code written to use MPI. This suggests that shared memory is easier than message-passing, which is contrary to the common belief. My own experience with OpenMP and MPI suggests it is easier to create a deadlock with message-passing than accidentally have threads access the same address concurrently. This is also what I hear from other people who writes code for scientific computing. I see a lot of claims that message-passing is supposed to be "safer" than a shared memory model, but that is not what we see with OpenMP and MPI. With MPI, the programmer must make sure that the send and receive commands are passed in the right order at the right time, in each process. This leaves plenty of room for messing up or creating unmaintainable spaghetti code, particularly in a complex algorithm. It is easier to make sure all shared objects are protected with mutexes than to make sure a spaghetti of send and receive messages are in correct order. It might be that Python's queue method of passing messages leave less room for deadlocking than the socket-like MPI_send and MPI_recv functions. But I think message-passing are sometimes overrated as "the safe solution" to multi-core programming (cf. Go and Erlang). Sturla
On Wed, Jun 24, 2015 at 5:45 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 25/06/15 02:09, Devin Jeanpierre wrote:
Although, threads are better for memory usage, by defaulting to sharing even on write. (Good for memory, maybe not so good for bug-freedom...)
I am not sure. Code written to use OpenMP tend to have less bugs than code written to use MPI. This suggests that shared memory is easier than message-passing, which is contrary to the common belief.
OpenMP is an *extremely* structured and constrained subset of shared memory multithreading, and not at all comparable to pthreads/threading.py/whatever. -n -- Nathaniel J. Smith -- http://vorpus.org
Nathaniel Smith <njs@pobox.com> wrote:
OpenMP is an *extremely* structured and constrained subset of shared memory multithreading, and not at all comparable to pthreads/threading.py/whatever.
If you use "parallel section" it is almost as free as using pthreads directly. But if you stick to "parallel for", which most do, you have a rather constrained and more well-behaved subset. I am quite sure MPI can even be a source of more errors than pthreads used directly. Getting message passing right inside a complex algorithm is not funny. I would rather keep my mind focused on which objects to protect with a lock or when to signal a condition. Sturla
On Jun 20, 2015 2:42 PM, "Eric Snow" <ericsnowcurrently@gmail.com> wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This all sounds really cool if you can pull it off, and shared-nothing threads do seem like the least impossible model to pull off. But "least impossible" and "possible" are different :-). From your email I can't tell whether this plan is viable while preserving backcompat and memory safety. Suppose I have a queue between two subinterpreters, and on this queue I place a list of dicts of user-defined-in-python objects, each of which holds a reference to a user-defined-via-the-C-api object. What happens next? -n
On Jun 20, 2015 4:08 PM, "Nathaniel Smith" <njs@pobox.com> wrote:
On Jun 20, 2015 2:42 PM, "Eric Snow" <ericsnowcurrently@gmail.com> wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This all sounds really cool if you can pull it off, and shared-nothing
threads do seem like the least impossible model to pull off. Agreed.
But "least impossible" and "possible" are different :-). From your email I can't tell whether this plan is viable while preserving backcompat and memory safety.
I agree that those issues must be clearly solved in the proposal before it can be approved. I'm confident the approach I'm pursuing will afford us the necessary guarantees. I'll address those specific points directly when I can sit down and organize my thoughts.
Suppose I have a queue between two subinterpreters, and on this queue I
place a list of dicts of user-defined-in-python objects, each of which holds a reference to a user-defined-via-the-C-api object. What happens next? You've hit upon exactly the trickiness involved and why I'm thinking the best approach initially is to only allow *strictly* immutable objects to pass between interpreters. Admittedly, my description of channels is very vague.:) There are a number of possibilities with them that I'm still exploring (CSP has particular opinions...), but immutability is a characteristic that may provide the simplest *initial* approach. Going that route shouldn't preclude adding some sort of support for mutable objects later. Keep in mind that by "immutability" I'm talking about *really* immutable, perhaps going so far as treating the full memory space associated with an object as frozen. For instance, we'd have to ensure that "immutable" Python objects like strings, ints, and tuples do not change (i.e. via the C API). The contents of involved tuples/containers would have to be likewise immutable. Even changing refcounts could be too much, hence the idea of moving refcounts out to a separate table. This level of immutability would be something new to Python. We'll see if it's necessary. If it isn't too much work it might be a good idea regardless of the multi-core proposal. Also note that Barry has a (rejected) PEP from a number of years ago about freezing objects... That idea is likely out of scope as relates to my proposal, but it certainly factors in the problem space. -eric
On Jun 20, 2015 3:54 PM, "Eric Snow" <ericsnowcurrently@gmail.com> wrote:
On Jun 20, 2015 4:08 PM, "Nathaniel Smith" <njs@pobox.com> wrote:
On Jun 20, 2015 2:42 PM, "Eric Snow" <ericsnowcurrently@gmail.com>
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This all sounds really cool if you can pull it off, and shared-nothing
wrote: threads do seem like the least impossible model to pull off.
Agreed.
But "least impossible" and "possible" are different :-). From your
email I can't tell whether this plan is viable while preserving backcompat and memory safety.
I agree that those issues must be clearly solved in the proposal before
Suppose I have a queue between two subinterpreters, and on this queue I
it can be approved. I'm confident the approach I'm pursuing will afford us the necessary guarantees. I'll address those specific points directly when I can sit down and organize my thoughts. I'd love to see just a hand wavy, verbal proof-of-concept walking through how this might work in some simple but realistic case. To me a single compelling example could make this proposal feel much more concrete and achievable. place a list of dicts of user-defined-in-python objects, each of which holds a reference to a user-defined-via-the-C-api object. What happens next?
You've hit upon exactly the trickiness involved and why I'm thinking the
Keep in mind that by "immutability" I'm talking about *really* immutable,
best approach initially is to only allow *strictly* immutable objects to pass between interpreters. Admittedly, my description of channels is very vague.:) There are a number of possibilities with them that I'm still exploring (CSP has particular opinions...), but immutability is a characteristic that may provide the simplest *initial* approach. Going that route shouldn't preclude adding some sort of support for mutable objects later. There aren't really many options for mutable objects, right? If you want shared nothing semantics, then transmitting a mutable object either needs to make a copy, or else be a real transfer, where the sender no longer has it (cf. Rust). I guess for the latter you'd need some new syntax for send-and-del, that requires the object to be self contained (all mutable objects reachable from it are only referenced by each other) and have only one reference in the sending process (which is the one being sent and then destroyed). perhaps going so far as treating the full memory space associated with an object as frozen. For instance, we'd have to ensure that "immutable" Python objects like strings, ints, and tuples do not change (i.e. via the C API). This seems like a red herring to me. It's already the case that you can't legally use the c api to mutate tuples, ints, for any object that's ever been, say, passed to a function. So for these objects, the subinterpreter setup doesn't actually add any new constraints on user code. C code is always going to be *able* to break memory safety so long as you're using shared-memory threading at the c level to implement this stuff. We just need to make it easy not to. Refcnts and garbage collection are another matter, of course. -n
On 21 June 2015 at 15:25, Nathaniel Smith <njs@pobox.com> wrote:
On Jun 20, 2015 3:54 PM, "Eric Snow" <ericsnowcurrently@gmail.com> wrote:
On Jun 20, 2015 4:08 PM, "Nathaniel Smith" <njs@pobox.com> wrote:
On Jun 20, 2015 2:42 PM, "Eric Snow" <ericsnowcurrently@gmail.com> wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This all sounds really cool if you can pull it off, and shared-nothing threads do seem like the least impossible model to pull off.
Agreed.
But "least impossible" and "possible" are different :-). From your email I can't tell whether this plan is viable while preserving backcompat and memory safety.
I agree that those issues must be clearly solved in the proposal before it can be approved. I'm confident the approach I'm pursuing will afford us the necessary guarantees. I'll address those specific points directly when I can sit down and organize my thoughts.
I'd love to see just a hand wavy, verbal proof-of-concept walking through how this might work in some simple but realistic case. To me a single compelling example could make this proposal feel much more concrete and achievable.
I was one of the folks pushing Eric in this direction, and that's because it's a possibility that was conceived of a few years back, but never tried due to lack of time (and inclination for those of us that are using Python primarily as an orchestration tool and hence spend most of our time on IO bound problems rather than CPU bound ones): http://www.curiousefficiency.org/posts/2012/07/volunteer-supported-free-thre... As mentioned there, I've at least spent some time with Graham Dumpleton over the past few years figuring out (and occasionally trying to address) some of the limitations of mod_wsgi's existing subinterpreter based WSGI app separation: https://code.google.com/p/modwsgi/wiki/ProcessesAndThreading#Python_Sub_Inte... The fact that mod_wsgi can run most Python web applications in a subinterpreter quite happily means we already know the core mechanism works fine, and there don't appear to be any insurmountable technical hurdles between the status quo and getting to a point where we can either switch the GIL to a read/write lock where a write lock is only needed for inter-interpreter communications, or else find a way for subinterpreters to release the GIL entirely by restricting them appropriately. For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead), but there don't appear to be any insurmountable barriers to setting up an object ownership based system instead (code that accesses PyObject_HEAD fields directly rather than through the relevant macros and functions seems to be the most likely culprit for breaking, but I think "don't do that" is a reasonable answer there). There's plenty of prior art here (including a system I once wrote in C myself atop TI's DSP/BIOS MBX and TSK APIs), so I'm comfortable with Eric's "simple matter of engineering" characterisation of the problem space. The main reason that subinterpreters have never had a Python API before is that they have enough rough edges that having to write a custom C extension module to access the API is the least of your problems if you decide you need them. At the same time, not having a Python API not only makes them much harder to test, which means various aspects of their operation are more likely to be broken, but also makes them inherently CPython specific. Eric's proposal essentially amounts to three things: 1. Filing off enough of the rough edges of the subinterpreter support that we're comfortable giving them a public Python level API that other interpreter implementations can reasonably support 2. Providing the primitives needed for safe and efficient message passing between subinterpreters 3. Allowing subinterpreters to truly execute in parallel on multicore machines All 3 of those are useful enhancements in their own right, which offers the prospect of being able to make incremental progress towards the ultimate goal of native Python level support for distributing across multiple cores within a single process. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, 21 Jun 2015 16:31:33 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead)
And memcpy() updates pointer references to dependent objects magically? Surely you meant the memdeepcopy() function that's part of every standard C library! Regards Antoine.
On 21 June 2015 at 19:48, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 21 Jun 2015 16:31:33 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead)
And memcpy() updates pointer references to dependent objects magically? Surely you meant the memdeepcopy() function that's part of every standard C library!
We already have the tools to do deep copies of object trees (although I'll concede I *was* actually thinking in terms of the classic C/C++ mistake of carelessly copying pointers around when I wrote that particular message). One of the options for deep copies tends to be a pickle/unpickle round trip, which will still incur the serialisation overhead, but not the IPC overhead. "Faster message passing than multiprocessing" sets the baseline pretty low, after all. However, this is also why Eric mentions the notions of object ownership or limiting channels to less than the full complement of Python objects. As an *added* feature at the Python level, it's possible to initially enforce restrictions that don't exist in the C level subinterpeter API, and then work to relax those restrictions over time. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan schrieb am 21.06.2015 um 12:25:
On 21 June 2015 at 19:48, Antoine Pitrou wrote:
On Sun, 21 Jun 2015 16:31:33 +1000 Nick Coghlan wrote:
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead)
And memcpy() updates pointer references to dependent objects magically? Surely you meant the memdeepcopy() function that's part of every standard C library!
We already have the tools to do deep copies of object trees (although I'll concede I *was* actually thinking in terms of the classic C/C++ mistake of carelessly copying pointers around when I wrote that particular message). One of the options for deep copies tends to be a pickle/unpickle round trip, which will still incur the serialisation overhead, but not the IPC overhead.
"Faster message passing than multiprocessing" sets the baseline pretty low, after all.
However, this is also why Eric mentions the notions of object ownership or limiting channels to less than the full complement of Python objects. As an *added* feature at the Python level, it's possible to initially enforce restrictions that don't exist in the C level subinterpeter API, and then work to relax those restrictions over time.
If objects can make it explicit that they support sharing (and preferably are allowed to implement the exact details themselves), I'm sure we'll find ways to share NumPy arrays across subinterpreters. That feature alone tends to be a quick way to make a lot of people happy. Stefan
On Sun, Jun 21, 2015 at 12:40:43PM +0200, Stefan Behnel wrote:
Nick Coghlan schrieb am 21.06.2015 um 12:25:
On 21 June 2015 at 19:48, Antoine Pitrou wrote:
On Sun, 21 Jun 2015 16:31:33 +1000 Nick Coghlan wrote:
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead)
And memcpy() updates pointer references to dependent objects magically? Surely you meant the memdeepcopy() function that's part of every standard C library!
We already have the tools to do deep copies of object trees (although I'll concede I *was* actually thinking in terms of the classic C/C++ mistake of carelessly copying pointers around when I wrote that particular message). One of the options for deep copies tends to be a pickle/unpickle round trip, which will still incur the serialisation overhead, but not the IPC overhead.
"Faster message passing than multiprocessing" sets the baseline pretty low, after all.
However, this is also why Eric mentions the notions of object ownership or limiting channels to less than the full complement of Python objects. As an *added* feature at the Python level, it's possible to initially enforce restrictions that don't exist in the C level subinterpeter API, and then work to relax those restrictions over time.
If objects can make it explicit that they support sharing (and preferably are allowed to implement the exact details themselves), I'm sure we'll find ways to share NumPy arrays across subinterpreters. That feature alone tends to be a quick way to make a lot of people happy.
FWIW, the following commit was all it took to get NumPy playing nicely with PyParallel: https://github.com/pyparallel/numpy/commit/046311ac1d66cec789fa8fd79b1b582a3... It uses thread-local buckets instead of static ones, and calls out to PyMem_Raw(Malloc|Realloc|Calloc|Free) instead of the normal libc counterparts. This means PyParallel will intercept the call within a parallel context and divert it to the per-context heap. Example parallel callback using NumPy: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... (Also, datrie is a Cython module, and that seems to work fine as well, which is neat, as it means you could sub out the entire Python callback with a Cythonized version, including all the relatively-slow-compared-to-C http header parsing that happens in async.http.server.) Trent.
Hey Trent, You may be interested in this PR for Numpy: https://github.com/numpy/numpy/pull/5470 Regards Antoine.
FWIW, the following commit was all it took to get NumPy playing nicely with PyParallel:
https://github.com/pyparallel/numpy/commit/046311ac1d66cec789fa8fd79b1b582a3...
On Sun, Jun 21, 2015 at 4:40 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
If objects can make it explicit that they support sharing (and preferably are allowed to implement the exact details themselves), I'm sure we'll find ways to share NumPy arrays across subinterpreters. That feature alone tends to be a quick way to make a lot of people happy.
Are you thinking of something along the lines of a dunder method (e.g. __reduce__)? -eric
Eric Snow schrieb am 24.06.2015 um 06:15:
On Sun, Jun 21, 2015 at 4:40 AM, Stefan Behnel wrote:
If objects can make it explicit that they support sharing (and preferably are allowed to implement the exact details themselves), I'm sure we'll find ways to share NumPy arrays across subinterpreters. That feature alone tends to be a quick way to make a lot of people happy.
Are you thinking of something along the lines of a dunder method (e.g. __reduce__)?
Sure. Should not be the first problem to tackle here, but dunder methods would be the obvious way to interact with whatever "share/move/copy between subinterpreters" protocol there will be. Stefan
On Sun, 21 Jun 2015 20:25:47 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 21 June 2015 at 19:48, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 21 Jun 2015 16:31:33 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead)
And memcpy() updates pointer references to dependent objects magically? Surely you meant the memdeepcopy() function that's part of every standard C library!
We already have the tools to do deep copies of object trees [...] "Faster message passing than multiprocessing" sets the baseline pretty low, after all.
What's the goal? 10% faster? Or 10x? copy.deepcopy() uses similar internal mechanisms as pickle... Regards Antoine.
On 21 June 2015 at 20:41, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 21 Jun 2015 20:25:47 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 21 June 2015 at 19:48, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 21 Jun 2015 16:31:33 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead)
And memcpy() updates pointer references to dependent objects magically? Surely you meant the memdeepcopy() function that's part of every standard C library!
We already have the tools to do deep copies of object trees [...] "Faster message passing than multiprocessing" sets the baseline pretty low, after all.
What's the goal? 10% faster? Or 10x? copy.deepcopy() uses similar internal mechanisms as pickle...
I'd want us to eventually aim for zero-copy speed for at least known immutable values (int, str, float, etc), immutable containers of immutable values (tuple, frozenset), and for types that support both publishing and consuming data via the PEP 3118 buffer protocol without making a copy. For everything else I'd be fine with a starting point that was at least no slower than multiprocessing (which shouldn't be difficult, since we'll at least save the IPC overhead even if there are cases where communication between subinterpreters falls back to serialisation rather than doing something more CPU and memory efficient). As an implementation strategy, I'd actually suggest starting with *only* the latter for simplicity's sake, even though it misses out on some of the potential speed benefits of sharing an address space. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, Jun 21, 2015 at 4:57 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I'd want us to eventually aim for zero-copy speed for at least known immutable values (int, str, float, etc), immutable containers of immutable values (tuple, frozenset), and for types that support both publishing and consuming data via the PEP 3118 buffer protocol without making a copy.
For everything else I'd be fine with a starting point that was at least no slower than multiprocessing (which shouldn't be difficult, since we'll at least save the IPC overhead even if there are cases where communication between subinterpreters falls back to serialisation rather than doing something more CPU and memory efficient).
Makes sense. -eric
On Sun, Jun 21, 2015 at 4:25 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
We already have the tools to do deep copies of object trees (although I'll concede I *was* actually thinking in terms of the classic C/C++ mistake of carelessly copying pointers around when I wrote that particular message). One of the options for deep copies tends to be a pickle/unpickle round trip, which will still incur the serialisation overhead, but not the IPC overhead.
This does make me wonder if it would be worth pursuing a mechanism for encapsulating an object graph, such that it would be easier to manage/copy the graph as a whole.
"Faster message passing than multiprocessing" sets the baseline pretty low, after all.
However, this is also why Eric mentions the notions of object ownership or limiting channels to less than the full complement of Python objects. As an *added* feature at the Python level, it's possible to initially enforce restrictions that don't exist in the C level subinterpeter API, and then work to relax those restrictions over time.
Precisely. -eric
On Sat, Jun 20, 2015 at 11:31 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead), but there don't appear to be any insurmountable barriers to setting up an object ownership based system instead (code that accesses PyObject_HEAD fields directly rather than through the relevant macros and functions seems to be the most likely culprit for breaking, but I think "don't do that" is a reasonable answer there).
The comparison is unfair -- if you can share between subinterpreters using memcpy, then you can share between processes using just a socket write, and multiprocessing becomes nearly just as fast.
Eric's proposal essentially amounts to three things:
1. Filing off enough of the rough edges of the subinterpreter support that we're comfortable giving them a public Python level API that other interpreter implementations can reasonably support 2. Providing the primitives needed for safe and efficient message passing between subinterpreters 3. Allowing subinterpreters to truly execute in parallel on multicore machines
All 3 of those are useful enhancements in their own right, which offers the prospect of being able to make incremental progress towards the ultimate goal of native Python level support for distributing across multiple cores within a single process.
Why is that the goal? Whatever faults processes have, those are the problems, surely not processes in and of themselves, right? e.g. if the reason we don't like multiprocessed python is extra memory use, it's memory use we're opposed to. A solution that gives us parallel threads, but doesn't decrease memory consumption, doesn't solve anything. The solution has threads that are remarkably like processes, so I think it's really important to be careful about the differences and why this solution has the advantage. I'm not seeing that. And remember that we *do* have many examples of people using parallelized Python code in production. Are you sure you're satisfying their concerns, or whose concerns are you trying to satisfy? -- Devin
Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
The comparison is unfair -- if you can share between subinterpreters using memcpy, then you can share between processes using just a socket write, and multiprocessing becomes nearly just as fast.
That is the main issue here. Writing to a pipe or a Unix socket is implemented with a memcpy in the kernel. So there is just a tiny constant overhead compared to using memcpy within a process. And with shared memory as IPC even this tiny overhead can be removed. The main overhead in communicating Python objects in multiprocessing is the serialization with pickle. So there is basically nothing to gain unless this part can be omitted. There is an errorneous belief among Windows programmers tht "IPC is slow". But that is because they are using out-proc DCOM server, CORBA, XMLRPC or something equally atrocious. A plain named pipe transaction is not in any way slow on Windows. Sturla
On Sun, Jun 21, 2015 at 6:13 AM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
The solution has threads that are remarkably like processes, so I think it's really important to be careful about the differences and why this solution has the advantage. I'm not seeing that.
Good point. I still think there are some significant differences (as already explained).
And remember that we *do* have many examples of people using parallelized Python code in production. Are you sure you're satisfying their concerns, or whose concerns are you trying to satisfy?
Another good point. What would you suggest is the best way to find out? -eric
On Sun, Jun 21, 2015 at 12:31 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The fact that mod_wsgi can run most Python web applications in a subinterpreter quite happily means we already know the core mechanism works fine,
This is a pretty important point.
and there don't appear to be any insurmountable technical hurdles between the status quo and getting to a point where we can either switch the GIL to a read/write lock where a write lock is only needed for inter-interpreter communications, or else find a way for subinterpreters to release the GIL entirely by restricting them appropriately.
Proper multi-core operation will require at least some changes relative to the GIL. My goal is to execute the least amount of change at first. We can build on that.
For inter-interpreter communication, the worst case scenario is having to rely on a memcpy based message passing system (which would still be faster than multiprocessing's serialisation + IPC overhead),
By initially focusing on immutable objects we shouldn't need to go that far. That said, a memcpy-based solution may very well be a good next step once the basic goals of the project are met.
but there don't appear to be any insurmountable barriers to setting up an object ownership based system instead
Agreed. That's something we can experiment with once we get the core of the project working.
(code that accesses PyObject_HEAD fields directly rather than through the relevant macros and functions seems to be the most likely culprit for breaking, but I think "don't do that" is a reasonable answer there).
:)
There's plenty of prior art here (including a system I once wrote in C myself atop TI's DSP/BIOS MBX and TSK APIs), so I'm comfortable with Eric's "simple matter of engineering" characterisation of the problem space.
Good. :)
The main reason that subinterpreters have never had a Python API before is that they have enough rough edges that having to write a custom C extension module to access the API is the least of your problems if you decide you need them. At the same time, not having a Python API not only makes them much harder to test, which means various aspects of their operation are more likely to be broken, but also makes them inherently CPython specific.
Eric's proposal essentially amounts to three things:
1. Filing off enough of the rough edges of the subinterpreter support that we're comfortable giving them a public Python level API that other interpreter implementations can reasonably support 2. Providing the primitives needed for safe and efficient message passing between subinterpreters 3. Allowing subinterpreters to truly execute in parallel on multicore machines
All 3 of those are useful enhancements in their own right, which offers the prospect of being able to make incremental progress towards the ultimate goal of native Python level support for distributing across multiple cores within a single process.
Yep. That sums it up pretty well. That decomposition should make it a bit easier to move the project forward. -eric
On Sat, Jun 20, 2015 at 11:25 PM, Nathaniel Smith <njs@pobox.com> wrote:
I'd love to see just a hand wavy, verbal proof-of-concept walking through how this might work in some simple but realistic case. To me a single compelling example could make this proposal feel much more concrete and achievable.
Here's a vague example: ------------------ from subinterpreters import Subinterpreter, Channel def handle_job(val): if not isinstance(val, (int, float)): raise RuntimeError("{!r} not a valid arg".format(val)) # something potentially expensive... def runner(ch): while True: value = ch.pop() # blocks if value is None: break handle_job(value) ch = Channel() sub = Subinterpreter() task = sub.run(runner, ch) data = get_data() for immutable_item in data: ch.push(immutable_item) if task.is_alive(): ch.push(None) task.join() exc = task.exception() if exc is not None: raise RuntimeError from exc def verify(data): # make sure runner did its job ... task = sub.run(verify, data) # do other stuff while we wait task.join() sub.destroy() ------------------
There aren't really many options for mutable objects, right? If you want shared nothing semantics, then transmitting a mutable object either needs to make a copy, or else be a real transfer, where the sender no longer has it (cf. Rust).
I guess for the latter you'd need some new syntax for send-and-del, that requires the object to be self contained (all mutable objects reachable from it are only referenced by each other) and have only one reference in the sending process (which is the one being sent and then destroyed).
Right. The idea of a self-contained object graph is something we'd need if we went that route. That's why initially we should focus on sharing only immutable objects.
Keep in mind that by "immutability" I'm talking about *really* immutable, perhaps going so far as treating the full memory space associated with an object as frozen. For instance, we'd have to ensure that "immutable" Python objects like strings, ints, and tuples do not change (i.e. via the C API).
This seems like a red herring to me. It's already the case that you can't legally use the c api to mutate tuples, ints, for any object that's ever been, say, passed to a function. So for these objects, the subinterpreter setup doesn't actually add any new constraints on user code.
Fair enough.
C code is always going to be *able* to break memory safety so long as you're using shared-memory threading at the c level to implement this stuff. We just need to make it easy not to.
Exactly.
Refcnts and garbage collection are another matter, of course.
Agreed. :) -eric
Exciting! * http://zero-buffer.readthedocs.org/en/latest/api-reference/#zero_buffer.Buff... * https://www.google.com/search?q=python+channels * https://docs.python.org/2/library/asyncore.html#module-asyncore * https://chan.readthedocs.org/en/latest/ * https://goless.readthedocs.org/en/latest/ * other approaches to the problem (with great APIs): * http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords * http://discodb.readthedocs.org/en/latest/ On Jun 20, 2015 5:55 PM, "Eric Snow" <ericsnowcurrently@gmail.com> wrote:
On Jun 20, 2015 4:08 PM, "Nathaniel Smith" <njs@pobox.com> wrote:
On Jun 20, 2015 2:42 PM, "Eric Snow" <ericsnowcurrently@gmail.com>
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This all sounds really cool if you can pull it off, and shared-nothing
wrote: threads do seem like the least impossible model to pull off.
Agreed.
But "least impossible" and "possible" are different :-). From your email I can't tell whether this plan is viable while preserving backcompat and memory safety.
I agree that those issues must be clearly solved in the proposal before it can be approved. I'm confident the approach I'm pursuing will afford us the necessary guarantees. I'll address those specific points directly when I can sit down and organize my thoughts.
Suppose I have a queue between two subinterpreters, and on this queue I
place a list of dicts of user-defined-in-python objects, each of which holds a reference to a user-defined-via-the-C-api object. What happens next?
You've hit upon exactly the trickiness involved and why I'm thinking the best approach initially is to only allow *strictly* immutable objects to pass between interpreters. Admittedly, my description of channels is very vague.:) There are a number of possibilities with them that I'm still exploring (CSP has particular opinions...), but immutability is a characteristic that may provide the simplest *initial* approach. Going that route shouldn't preclude adding some sort of support for mutable objects later.
Keep in mind that by "immutability" I'm talking about *really* immutable, perhaps going so far as treating the full memory space associated with an object as frozen. For instance, we'd have to ensure that "immutable" Python objects like strings, ints, and tuples do not change (i.e. via the C API). The contents of involved tuples/containers would have to be likewise immutable. Even changing refcounts could be too much, hence the idea of moving refcounts out to a separate table.
This level of immutability would be something new to Python. We'll see if it's necessary. If it isn't too much work it might be a good idea regardless of the multi-core proposal.
Also note that Barry has a (rejected) PEP from a number of years ago about freezing objects... That idea is likely out of scope as relates to my proposal, but it certainly factors in the problem space.
-eric
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Jun 21, 2015 at 12:41 AM, Wes Turner <wes.turner@gmail.com> wrote:
Exciting!
* http://zero-buffer.readthedocs.org/en/latest/api-reference/#zero_buffer.Buff... * https://www.google.com/search?q=python+channels * https://docs.python.org/2/library/asyncore.html#module-asyncore * https://chan.readthedocs.org/en/latest/ * https://goless.readthedocs.org/en/latest/ * other approaches to the problem (with great APIs): * http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords * http://discodb.readthedocs.org/en/latest/
Thanks. -eric
On 06/20/2015 06:54 PM, Eric Snow wrote:
Also note that Barry has a (rejected) PEP from a number of years ago about freezing objects... That idea is likely out of scope as relates to my proposal, but it certainly factors in the problem space.
How about instead of freezing, just modify a flag or counter if it's mutated. That could be turned off by default. Then have a way to turn on an ObjectMutated warning or exception if any objects is modified within a routine, code block. or function. With something like that, small parts of python can be tested and made less mutable in small sections at a time. Possibly working from the inside out. It doesn't force immutability but instead asks for it. A small but not quite so impossible step. (?) Cheers, Ron
On Sun, Jun 21, 2015 at 7:42 AM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
* disallow forking within subinterpreters
I love the idea as a whole (if only because the detractors can be told "Just use subinterpreters, then you get concurrency"), but this seems like a tricky restriction. That means no subprocess.Popen, no shelling out to other applications. And I don't know what of other restrictions might limit any given program. Will it feel like subinterpreters are "write your code according to these tight restrictions and it'll work", or will it be more of "most programs will run in parallel just fine, but there are a few things to be careful of"? ChrisA
On Sat, Jun 20, 2015 at 6:41 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jun 21, 2015 at 7:42 AM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
* disallow forking within subinterpreters
I love the idea as a whole (if only because the detractors can be told "Just use subinterpreters, then you get concurrency"), but this seems like a tricky restriction. That means no subprocess.Popen, no shelling out to other applications. And I don't know what of other restrictions might limit any given program.
This is just something I'm thinking about. To be honest, forking probably won't be a problem. Furthermore, if there were any restriction it would likely just be on forking Python (a la multiprocessing). However, I doubt there will be a need to pursue such a restriction. As I said, there are still a lot of open questions and subtle details to sort out.
Will it feel like subinterpreters are "write your code according to these tight restrictions and it'll work", or will it be more of "most programs will run in parallel just fine, but there are a few things to be careful of"?
I expect that will be somewhat the case no matter what. The less restrictions the better, though. :) It's a balancing act because I expect that with some initial restrictions we can land the feature sooner. Then we could look into how to relax the restrictions. I just want to be careful that we don't paint ourselves into a corner in that regard. -eric
On 21 June 2015 at 10:41, Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jun 21, 2015 at 7:42 AM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
* disallow forking within subinterpreters
I love the idea as a whole (if only because the detractors can be told "Just use subinterpreters, then you get concurrency"), but this seems like a tricky restriction. That means no subprocess.Popen, no shelling out to other applications. And I don't know what of other restrictions might limit any given program. Will it feel like subinterpreters are "write your code according to these tight restrictions and it'll work", or will it be more of "most programs will run in parallel just fine, but there are a few things to be careful of"?
To calibrate expectations appropriately, it's worth thinking about the concept of Python level subinterpreter support as being broadly comparable to the JavaScript concept of web worker threads. mod_wsgi's use of the existing CPython specific subinterpreter support when embedding CPython in Apache httpd means we already know subinterpreters largely "just work" in the absence of low level C shenanigans in extension modules, but we also know keeping subinterpreters clearly subordinate to the main interpreter simplifies a number of design and implementation aspects (just as having a main thread simplified various aspects of the threading implementation), and that there will likely be things the main interpreter can do that subinterpreters can't. A couple of possible examples: * as Eric noted, we don't know yet if we'll be able to safely let subinterpreters launch subprocesses (especially via fork) * there may be restrictions on some extension modules that limit them to "main interpreter only" (e.g. if the extension module itself isn't thread-safe, then it will need to remain fully protected by the GIL) The analogous example with web workers is the fact that they don't have any access to the window object, document object or parent object in the browser DOM. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan schrieb am 21.06.2015 um 03:28:
* there may be restrictions on some extension modules that limit them to "main interpreter only" (e.g. if the extension module itself isn't thread-safe, then it will need to remain fully protected by the GIL)
Just an idea, but C extensions could opt-in to this. Calling into them has to go through some kind of callable type, usually PyCFunction. We could protect all calls to extension types and C functions with a global runtime lock (per process, not per interpreter) and Extensions could set a flag on their functions and methods (or get it inherited from their extension types etc.) that says "I don't need the lock". That allows for a very fine-grained transition. Stefan
On Sun, Jun 21, 2015 at 7:06 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Nick Coghlan schrieb am 21.06.2015 um 03:28:
* there may be restrictions on some extension modules that limit them to "main interpreter only" (e.g. if the extension module itself isn't thread-safe, then it will need to remain fully protected by the GIL)
Just an idea, but C extensions could opt-in to this. Calling into them has to go through some kind of callable type, usually PyCFunction. We could protect all calls to extension types and C functions with a global runtime lock (per process, not per interpreter) and Extensions could set a flag on their functions and methods (or get it inherited from their extension types etc.) that says "I don't need the lock". That allows for a very fine-grained transition.
Exactly. PEP 489 helps facilitate opting in as well, right? -eric
On 24 June 2015 at 15:33, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Sun, Jun 21, 2015 at 7:06 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Nick Coghlan schrieb am 21.06.2015 um 03:28:
* there may be restrictions on some extension modules that limit them to "main interpreter only" (e.g. if the extension module itself isn't thread-safe, then it will need to remain fully protected by the GIL)
Just an idea, but C extensions could opt-in to this. Calling into them has to go through some kind of callable type, usually PyCFunction. We could protect all calls to extension types and C functions with a global runtime lock (per process, not per interpreter) and Extensions could set a flag on their functions and methods (or get it inherited from their extension types etc.) that says "I don't need the lock". That allows for a very fine-grained transition.
Exactly. PEP 489 helps facilitate opting in as well, right?
Yep, as PEP 489 requires subinterpreter compatibility as a precondition for using multi-phase initialisation :) Cheers, Nick. P.S. Technically, what it actually requires is support for "multiple instances of the module existing in the same process at the same time", as it really recreates the module if you remove it from sys.modules and import it again, unlike single phase initialisation. But that's a mouthful, so "must support subinterpreters" is an easier shorthand. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 24.06.2015 10:22, Nick Coghlan wrote:
On 24 June 2015 at 15:33, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Sun, Jun 21, 2015 at 7:06 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Nick Coghlan schrieb am 21.06.2015 um 03:28:
* there may be restrictions on some extension modules that limit them to "main interpreter only" (e.g. if the extension module itself isn't thread-safe, then it will need to remain fully protected by the GIL)
Just an idea, but C extensions could opt-in to this. Calling into them has to go through some kind of callable type, usually PyCFunction. We could protect all calls to extension types and C functions with a global runtime lock (per process, not per interpreter) and Extensions could set a flag on their functions and methods (or get it inherited from their extension types etc.) that says "I don't need the lock". That allows for a very fine-grained transition.
Exactly. PEP 489 helps facilitate opting in as well, right?
Yep, as PEP 489 requires subinterpreter compatibility as a precondition for using multi-phase initialisation :)
Cheers, Nick.
P.S. Technically, what it actually requires is support for "multiple instances of the module existing in the same process at the same time", as it really recreates the module if you remove it from sys.modules and import it again, unlike single phase initialisation. But that's a mouthful, so "must support subinterpreters" is an easier shorthand.
Note that extension modules often interface to other C libraries which typically use some setup logic that is not thread safe, but is used to initialize the other thread safe parts. E.g. setting up locks and shared memory for all threads to use is a typical scenario you find in such libs. A requirement to be able to import modules multiple times would pretty much kill the idea for those modules. That said, I don't think this is really needed. Modules would only have to be made aware that there is a global first time setup phase and a later shutdown/reinit phase. As a result, the module DLL would load only once, but then use the new module setup logic to initialize its own state multiple times. That said, I still think the multiple-process is a better one (more robust, more compatible, fewer problems). We'd just need a way more efficient approach to sharing objects between the Python processes than using pickle and shared memory or pipes :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 24 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-06-16: Released eGenix pyOpenSSL 0.13.10 ... http://egenix.com/go78 2015-07-20: EuroPython 2015, Bilbao, Spain ... 26 days to go 2015-07-29: Python Meeting Duesseldorf ... 35 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 24/06/15 13:43, M.-A. Lemburg wrote:
That said, I still think the multiple-process is a better one (more robust, more compatible, fewer problems). We'd just need a way more efficient approach to sharing objects between the Python processes than using pickle and shared memory or pipes :-)
It is hard to get around shared memory, Unix domain sockets, or pipes. There must be some sort of IPC, regardless. One idea I have played with is to use a specialized queue instead of the current multiprocessing.Queue. In scientific computing we often need to pass arrays, so it would make sense to have a queue that could bypass pickle for NumPy arrays, scalars and dtypes, simply by using the NumPy C API to process the data. It could also have specialized code for a number of other objects -- at least str, int, float, complex, and PEP 3118 buffers, but perhaps also simple lists, tuples and dicts with these types. I think it should be possible to make a queue that would avoid the pickle issue for 99 % of scientific computing. It would be very easy to write such a queue with Cython and e.g. have it as a part of NumPy or SciPy. One thing I did some years ago was to have NumPy arrays that would store the data in shared memory. And when passed to multiprocessing.Queue they would not pickle the data buffer, only the metadata. However this did not improve on performance, because the pickle overhead was still there, and passing a lot of binary data over a pipe was not comparably expensive. So while it would save memory, it did not make programs using multiprocessing and NumPy more efficient. Sturla
On 24.06.2015 18:58, Sturla Molden wrote:
On 24/06/15 13:43, M.-A. Lemburg wrote:
That said, I still think the multiple-process is a better one (more robust, more compatible, fewer problems). We'd just need a way more efficient approach to sharing objects between the Python processes than using pickle and shared memory or pipes :-)
It is hard to get around shared memory, Unix domain sockets, or pipes. There must be some sort of IPC, regardless.
Sure, but the current approach of pickling Python objects for communication is just too much overhead in many cases - it also duplicates the memory requirements when using the multiple process approach since you eventually end up having n copies of the same data in memory (with n = number of parallel workers).
One idea I have played with is to use a specialized queue instead of the current multiprocessing.Queue. In scientific computing we often need to pass arrays, so it would make sense to have a queue that could bypass pickle for NumPy arrays, scalars and dtypes, simply by using the NumPy C API to process the data. It could also have specialized code for a number of other objects -- at least str, int, float, complex, and PEP 3118 buffers, but perhaps also simple lists, tuples and dicts with these types. I think it should be possible to make a queue that would avoid the pickle issue for 99 % of scientific computing. It would be very easy to write such a queue with Cython and e.g. have it as a part of NumPy or SciPy.
The tricky part is managing pointers in those data structures, e.g. a container types for other Python objects will have to store all referenced objects in the shared memory segment as well. For NumPy arrays using simple types this is a lot easier, since you don't have to deal with pointers to other objects.
One thing I did some years ago was to have NumPy arrays that would store the data in shared memory. And when passed to multiprocessing.Queue they would not pickle the data buffer, only the metadata. However this did not improve on performance, because the pickle overhead was still there, and passing a lot of binary data over a pipe was not comparably expensive. So while it would save memory, it did not make programs using multiprocessing and NumPy more efficient.
When saying "passing a lot of binary data over a pipe" you mean the meta-data ? I had discussed the idea of Python object sharing with Larry Hastings back in 2013, but decided that trying to get all references of containers managed in the shared memory would be too fragile an approach to pursue further. Still, after some more research later that year, I found that someone already had investigated the idea in 2003: http://poshmodule.sourceforge.net/ Reading the paper on this: http://poshmodule.sourceforge.net/posh/posh.pdf made me wonder why this idea never received more attention in all these years. His results are clearly positive and show that the multiple process approach can provide better scalability than using threads when combined with shared memory object storage. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 24 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-06-16: Released eGenix pyOpenSSL 0.13.10 ... http://egenix.com/go78 2015-07-20: EuroPython 2015, Bilbao, Spain ... 26 days to go 2015-07-29: Python Meeting Duesseldorf ... 35 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 24/06/15 22:50, M.-A. Lemburg wrote:
The tricky part is managing pointers in those data structures, e.g. a container types for other Python objects will have to store all referenced objects in the shared memory segment as well.
If a container type for Python objects contains some unknown object type we would have to use pickle as fallback.
For NumPy arrays using simple types this is a lot easier, since you don't have to deal with pointers to other objects.
The objects we deal with in scientific computing are usually arrays with a rather regular structure, not deeply nested Python objects. Even a more complex object like scipy.spatial.cKDTree is just a collection of a few contiguous arrays under the hood. So we could for most parts squash the pickle overhead that anyone will encounter by specializing a queue that has knowledge about a small set of Python types.
When saying "passing a lot of binary data over a pipe" you mean the meta-data ?
No, I mean the buffer pointed to by PyArray_DATA(obj) when using the NumPy C API. We have to send a lot of raw bytes over an IPC mechanism before this communication compares to the pickle overhead. Sturla
On 24/06/15 23:41, Sturla Molden wrote:
So we could for most parts squash the pickle overhead that anyone will encounter by specializing a queue that has knowledge about a small set of Python types.
But this would be very domain specific for scientific and numerical computing, it would not be a general improvement for multiprocessing with Python. Sturla
On Jun 24, 2015 4:49 PM, "Sturla Molden" <sturla.molden@gmail.com> wrote:
On 24/06/15 23:41, Sturla Molden wrote:
So we could for most parts squash the pickle overhead that anyone will encounter by specializing a queue that has knowledge about a small set of Python types.
But this would be very domain specific for scientific and numerical
computing, it would not be a general improvement for multiprocessing with Python. Basically C structs like Thrift or Protocol Buffers?
Sturla
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 24 June 2015 at 21:43, M.-A. Lemburg <mal@egenix.com> wrote:
Note that extension modules often interface to other C libraries which typically use some setup logic that is not thread safe, but is used to initialize the other thread safe parts. E.g. setting up locks and shared memory for all threads to use is a typical scenario you find in such libs.
A requirement to be able to import modules multiple times would pretty much kill the idea for those modules.
Yep, that's the reason earlier versions of PEP 489 included the notion of "singleton modules". We ended up deciding to back that out for the time being, and instead leave those modules using the existing single phase initialisation model.
That said, I don't think this is really needed. Modules would only have to be made aware that there is a global first time setup phase and a later shutdown/reinit phase.
As a result, the module DLL would load only once, but then use the new module setup logic to initialize its own state multiple times.
Aye, buying more time to consider alternative designs was the reason we dropped the "singleton module" idea from multi-phase initialisation until 3.6 at the earliest. I think your idea here has potential - it should just require a new Py_mod_setup slot identifier, and a bit of additional record keeping to track which modules had already had their setup slots invoked. (It's conceivable there could also be a process-wide Py_mod_teardown slot, but that gets messy in the embedded interpreter case where we might have multiple Py_Initialize/Py_Finalize cycles) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sunday, June 21, 2015 at 6:12:22 AM UTC+5:30, Chris Angelico wrote:
On Sun, Jun 21, 2015 at 7:42 AM, Eric Snow <ericsnow...@gmail.com <javascript:>> wrote:
* disallow forking within subinterpreters
I love the idea as a whole (if only because the detractors can be told "Just use subinterpreters, then you get concurrency"), but this seems like a tricky restriction. That means no subprocess.Popen, no shelling out to other applications. And I don't know what of other restrictions might limit any given program. Will it feel like subinterpreters are "write your code according to these tight restrictions and it'll work", or will it be more of "most programs will run in parallel just fine, but there are a few things to be careful of"?
ChrisA
Its good to get our terminology right: Are we talking parallelism or concurrency? Some references on the distinction: Bob Harper: https://existentialtype.wordpress.com/2011/03/17/parallelism-is-not-concurre... Rob Pike: http://concur.rspace.googlecode.com/hg/talk/concur.html#landing-slide [Or if you prefer the more famous https://www.youtube.com/watch?v=cN_DpYBzKso ]
On Sat, Jun 20, 2015 at 5:42 PM Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jun 21, 2015 at 7:42 AM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
* disallow forking within subinterpreters
I love the idea as a whole (if only because the detractors can be told "Just use subinterpreters, then you get concurrency"), but this seems like a tricky restriction. That means no subprocess.Popen, no shelling out to other applications. And I don't know what of other restrictions might limit any given program. Will it feel like subinterpreters are "write your code according to these tight restrictions and it'll work", or will it be more of "most programs will run in parallel just fine, but there are a few things to be careful of"?
It wouldn't disallow use of subprocess, only os.fork(). C extension modules can alway fork. The restriction being placed in this scheme is: "if your extension module code forks from a subinterpreter, the child process MUST not return control to Python." I'm not sure if this restriction would actually be *needed* or not but I agree with it regardless. -gps
On Tue, Jun 23, 2015 at 3:03 AM, Gregory P. Smith <greg@krypto.org> wrote:
On Sat, Jun 20, 2015 at 5:42 PM Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jun 21, 2015 at 7:42 AM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
* disallow forking within subinterpreters
I love the idea as a whole (if only because the detractors can be told "Just use subinterpreters, then you get concurrency"), but this seems like a tricky restriction. That means no subprocess.Popen, no shelling out to other applications. And I don't know what of other restrictions might limit any given program. Will it feel like subinterpreters are "write your code according to these tight restrictions and it'll work", or will it be more of "most programs will run in parallel just fine, but there are a few things to be careful of"?
It wouldn't disallow use of subprocess, only os.fork(). C extension modules can alway fork. The restriction being placed in this scheme is: "if your extension module code forks from a subinterpreter, the child process MUST not return control to Python."
I'm not sure if this restriction would actually be needed or not but I agree with it regardless.
Oh! That's fine, then. Sounds good to me! ChrisA
On Sat, Jun 20, 2015 at 03:42:33PM -0600, Eric Snow wrote:
* only allow passing plain functions to Task() and Subinterpreter.run() (exclude closures, other callables)
That doesn't sound very Pythonic to me. That's going to limit the usefulness of these subinterpreters.
* object ownership model + read-only in all but 1 subinterpreter + RW in all subinterpreters
Isn't that a contradiction? If objects are read-only in all subinterpreters (except one), how can they be read/write in all subinterpreters? All this talk about subinterpreters reminds me of an interesting blog post by Armin Ronacher: http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see He's quite critical of a number of internal details of the CPython interpreter. But what I take from his post is that there could be significant advantages to giving the CPython interpreter its own local environment, like Lua and Javascript typically do, rather than the current model where there is a single process-wide global environment. Instead of having multiple subinterpreters all running inside the main interpreter, you could have multiple interpreters running in the same process, each with their own environment. I may be completely misinterpreting things here, but as I understand it, this would remove the need for the GIL, allowing even plain old threads to take advantage of multiple cores. But that's a separate issue. Armin writes: I would like to see an internal interpreter design could be based on interpreters that work independent of each other, with local base types and more, similar to how JavaScript works. This would immediately open up the door again for embedding and concurrency based on message passing. CPUs won't get any faster :) (He also talks about CPython's tp_slots system, but that's a separate issue, I think.) Now I have no idea if Armin is correct, or whether I am even interpreting his post correctly. But I'd like to hear people's thoughts on how this might interact with Eric's suggestion. -- Steve
On Jun 20, 2015 9:38 PM, "Steven D'Aprano" <steve@pearwood.info> wrote:
On Sat, Jun 20, 2015 at 03:42:33PM -0600, Eric Snow wrote:
* only allow passing plain functions to Task() and Subinterpreter.run() (exclude closures, other callables)
That doesn't sound very Pythonic to me. That's going to limit the usefulness of these subinterpreters.
It certainly would limit their usefulness. It's a tradeoff to make the project tractable. I'm certainly not opposed to dropping such restrictions, now or as a follow-up project. Also keep in mind that the restriction is only something I'm considering. It's too early to settle on many of these details.
* object ownership model + read-only in all but 1 subinterpreter + RW in all subinterpreters
Isn't that a contradiction? If objects are read-only in all subinterpreters (except one), how can they be read/write in all subinterpreters?
True. The two statements, like the rest in the section, are summarizing different details and ideas into which I've been looking. Several of them are mutually exclusive.
All this talk about subinterpreters reminds me of an interesting blog post by Armin Ronacher:
http:// <http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see>
lucumr.pocoo.org <http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see> /2014/8/16/the-python-i-would-like-to-see <http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see>
Interesting. I'd read that before, but not recently. Armin has some interesting points but I can't say that I agree with his analysis or his conclusions. Regardless...
He's quite critical of a number of internal details of the CPython interpreter. But what I take from his post is that there could be significant advantages to giving the CPython interpreter its own local environment, like Lua and Javascript typically do, rather than the current model where there is a single process-wide global environment. Instead of having multiple subinterpreters all running inside the main interpreter, you could have multiple interpreters running in the same process, each with their own environment.
But that's effectively the goal! This proposal will not work if the interpreters are not isolated. I'm not clear on what Armin thinks is shared between interpreters. The only consequential shared piece is the GIL and my proposal should render the GIL irrelevant for the most part.
I may be completely misinterpreting things here, but as I understand it, this would remove the need for the GIL, allowing even plain old threads to take advantage of multiple cores. But that's a separate issue.
If we restrict each subinterpreter to a single thread and are careful with how objects are shared (and sort out exrension modules) then there will be no need for the GIL *within* each subinterpreter. However there are a couple of things that will keep the GIL around for now.
Armin writes:
I would like to see an internal interpreter design could be based on interpreters that work independent of each other, with local base types and more, similar to how JavaScript works. This would immediately open up the door again for embedding and concurrency based on message passing. CPUs won't get any faster :)
That's almost exactly what I'm aiming for. :) -eric
On Sat, 20 Jun 2015 23:01:20 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
The only consequential shared piece is the GIL and my proposal should render the GIL irrelevant for the most part.
All singleton objects, built-in types are shared and probably a number of other things hidden in dark closets... Not to mention the memory allocator. By the way, what you're aiming to do is conceptually quite similar to Trent's PyParallel (thought Trent doesn't use subinterpreters, his main work is around trying to making object sharing safe without any GIL to trivially protect the sharing), so you may want to pair with him. Of course, you may end up with a Windows-only Python interpreter :-) I'm under the impression you're underestimating the task at hand here. Or perhaps you're not and you're just willing to present it in a positive way :-) Regards Antoine.
On Sun, Jun 21, 2015 at 3:54 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sat, 20 Jun 2015 23:01:20 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
The only consequential shared piece is the GIL and my proposal should render the GIL irrelevant for the most part.
All singleton objects, built-in types are shared and probably a number of other things hidden in dark closets...
Yep. I expect we'll be able to sort those out under the assumption that 99% of the time they can be treated as immutable. We'll then have to find a way to keep the corner cases from breaking the subinterpreter isolation.
Not to mention the memory allocator.
This is a sticky part that I've been considering from almost day 1. It's not the #1 problem to solve, but it will be an important one if we want to have truly parallel subinterpreters.
By the way, what you're aiming to do is conceptually quite similar to Trent's PyParallel (thought Trent doesn't use subinterpreters, his main work is around trying to making object sharing safe without any GIL to trivially protect the sharing), so you may want to pair with him. Of course, you may end up with a Windows-only Python interpreter :-)
Right. I read through Trent's work on several occasions and have gleaned a couple lessons related to object sharing. I was planning on getting in touch with Trent in the near future.
I'm under the impression you're underestimating the task at hand here. Or perhaps you're not and you're just willing to present it in a positive way :-)
I'd like to think it's the latter. :) The main reason why I'm hopeful we can make a meaningful change for 3.6 is that I don't foresee any major changes to CPython's internals. Nearly all the necessary pieces are already there. <handwave/> I'm also intent on taking a minimal approach initially. We can build on it from there, easing restrictions that allowed us to roll out the initial implementation more quickly. All that said, I won't be surprised if it takes the entire 3.6 dev cycle to get it right. -eric
Eric Snow schrieb am 20.06.2015 um 23:42:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them. [...] In some personal correspondence Nick Coghlan, he summarized my preferred approach as "the data storage separation of multiprocessing, with the low message passing overhead of threading".
For Python 3.6:
* expose subinterpreters to Python in a new stdlib module: "subinterpreters" * add a new SubinterpreterExecutor to concurrent.futures * add a queue.Queue-like type that will be used to explicitly share objects between subinterpreters [...] C Extension Modules =================
Subinterpreters already isolate extension modules (and built-in modules, including sys). PEP 384 provides some help too. However, global state in C can easily leak data between subinterpreters, breaking the desired data isolation. This is something that will need to be addressed as part of the effort.
I also had some discussions about these things with Nick before. Not sure if you really meant PEP 384 (you might have) or rather PEP 489: https://www.python.org/dev/peps/pep-0489/ I consider that one more important here, as it will eventually allow Cython modules to support subinterpreters. Unless, as you mentioned, they use global C state, but only in external C code, e.g. wrapped libraries. Cython should be able to handle most of the module internal global state on a per-interpreter basis itself, without too much user code impact. I'm totally +1 for the idea. I hope that I'll find the time (well, and money) to work on PEP 489 in Cython soon, so that I can prove it right for actual real-world code in Python 3.5. We'll then see about subinterpreter support. That's certainly the next step. Stefan
On Sun, Jun 21, 2015 at 4:54 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I also had some discussions about these things with Nick before. Not sure if you really meant PEP 384 (you might have) or rather PEP 489:
I did mean PEP 384, but PEP 489 is certainly related as I expect we'll make participation in this subinterpreter model by extension modules opt-in. Basically they will need to promise that they will work within the restricted environment.
I consider that one more important here, as it will eventually allow Cython modules to support subinterpreters. Unless, as you mentioned, they use global C state, but only in external C code, e.g. wrapped libraries. Cython should be able to handle most of the module internal global state on a per-interpreter basis itself, without too much user code impact.
Great.
I'm totally +1 for the idea. I hope that I'll find the time (well, and money) to work on PEP 489 in Cython soon, so that I can prove it right for actual real-world code in Python 3.5. We'll then see about subinterpreter support. That's certainly the next step.
That would be super. -eric
On 20/06/15 23:42, Eric Snow wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This proposal is meant to be a shot over the bow, so to speak. I plan on putting together a more complete PEP some time in the future, with content that is more refined along with references to the appropriate online resources.
Feedback appreciated! Offers to help even more so! :)
From the perspective of software design, it would be good it the CPython interpreter provided an environment instead of using global objects. It would mean that all functions in the C API would need to take the environment pointer as their first variable, which will be a major rewrite. It would also allow the "one interpreter per thread" design similar to tcl and .NET application domains. However, from the perspective of multi-core parallel computing, I am not sure what this offers over using multiple processes. Yes, you avoid the process startup time, but on POSIX systems a fork is very fast. An certainly, forking is much more efficient than serializing Python objects. It then boils down to a workaround for the fact that Windows cannot fork, which makes it particularly bad for running CPython. You also have to start up a subinterpreter and a thread, which is not instantaneous. So I am not sure there is a lot to gain here over calling os.fork. A non-valid argument for this kind of design is that only code which uses threads for parallel computing is "real" multi-core code. So Python does not support multi-cores because multiprocessing or os.fork is just faking it. This is an argument that belongs in the intellectual junk yard. It stems from the abuse of threads among Windows and Java developers, and is rooted in the absence of fork on Windows and the formerly slow fork on Solaris. And thus they are only able to think in terms of threads. If threading.Thread does not scale the way they want, they think multicores are out of reach. So the question is, how do you want to share objects between subinterpreters? And why is it better than IPC, when your idea is to isolate subinterpreters like application domains? If you think avoiding IPC is clever, you are wrong. IPC is very fast, in fact programs written to use MPI tends to perform and scale better than programs written to use OpenMP in parallel computing. Not only is IPC fast, but you also avoid an issue called "false sharing", which can be even more detrimental than the GIL: You have parallel code, but it seems to run in serial, even though there is no explicit serialization anywhere. And by since Murphy's law is working against us, Python reference counts will be false shared unless we use multiple processes. The reason IPC in multiprocessing is slow is due to calling pickle, it is not the IPC in itself. A pipe or an Unix socket (named pipe on Windows) have the overhead of a memcpy in the kernel, which is equal to a memcpy plus some tiny constant overhead. And if you need two processes to share memory, there is something called shared memory. Thus, we can send data between processes just as fast as between subinterpreters. All in all, I think we are better off finding a better way to share Python objects between processes. P.S. Another thing to note is that with sub-interpreters, you can forget about using ctypes or anything else that uses the simplified GIL API (e.g. certain Cython generated extensions). Sturla
On Sun, 21 Jun 2015 13:41:30 +0200 Sturla Molden <sturla.molden@gmail.com> wrote:
From the perspective of software design, it would be good it the CPython interpreter provided an environment instead of using global objects. It would mean that all functions in the C API would need to take the environment pointer as their first variable, which will be a major rewrite. It would also allow the "one interpreter per thread" design similar to tcl and .NET application domains.
From the point of view of API compatibility, it's unfortunately a no-no.
The reason IPC in multiprocessing is slow is due to calling pickle, it is not the IPC in itself.
No need to be pedantic :-) The "C" means communication, and pickling objects is part of the communication between Python processes.
All in all, I think we are better off finding a better way to share Python objects between processes.
Sure. This is however a complex and experimental topic (how to share a graph of garbage-collected objects between independant processes), with no guarantees of showing any results at the end.
P.S. Another thing to note is that with sub-interpreters, you can forget about using ctypes or anything else that uses the simplified GIL API (e.g. certain Cython generated extensions).
Indeed, the PyGILState API is still not subinterpreter-compatible. There's a proposal on the tracker, IIRC, but the interested parties never made any progress on it. Regards Antoine.
On Sun, 21 Jun 2015 13:52:36 +0200 Antoine Pitrou <solipsis@pitrou.net> wrote:
P.S. Another thing to note is that with sub-interpreters, you can forget about using ctypes or anything else that uses the simplified GIL API (e.g. certain Cython generated extensions).
Indeed, the PyGILState API is still not subinterpreter-compatible. There's a proposal on the tracker, IIRC, but the interested parties never made any progress on it.
For reference: https://bugs.python.org/issue10915 https://bugs.python.org/issue15751 Regards Antoine.
Antoine Pitrou <solipsis@pitrou.net> wrote:
The reason IPC in multiprocessing is slow is due to calling pickle, it is not the IPC in itself.
No need to be pedantic :-) The "C" means communication, and pickling objects is part of the communication between Python processes.
Yes, currently it is. But is does not mean that it has to be. Clearly it is easier to avoid with multiple interpreters in the same process. But it does not mean it is unsolvable. Sturla
On 21 June 2015 at 21:41, Sturla Molden <sturla.molden@gmail.com> wrote:
On 20/06/15 23:42, Eric Snow wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This proposal is meant to be a shot over the bow, so to speak. I plan on putting together a more complete PEP some time in the future, with content that is more refined along with references to the appropriate online resources.
Feedback appreciated! Offers to help even more so! :)
From the perspective of software design, it would be good it the CPython interpreter provided an environment instead of using global objects. It would mean that all functions in the C API would need to take the environment pointer as their first variable, which will be a major rewrite. It would also allow the "one interpreter per thread" design similar to tcl and .NET application domains.
However, from the perspective of multi-core parallel computing, I am not sure what this offers over using multiple processes.
Yes, you avoid the process startup time, but on POSIX systems a fork is very fast. An certainly, forking is much more efficient than serializing Python objects. It then boils down to a workaround for the fact that Windows cannot fork, which makes it particularly bad for running CPython. You also have to start up a subinterpreter and a thread, which is not instantaneous. So I am not sure there is a lot to gain here over calling os.fork.
Please give Eric and I the courtesy of assuming we know how CPython works. This article, which is an update of a Python 3 Q&A answer I wrote some time ago, goes into more detail on the background of this proposed investigation: http://python-notes.curiousefficiency.org/en/latest/python3/multicore_python...
A non-valid argument for this kind of design is that only code which uses threads for parallel computing is "real" multi-core code. So Python does not support multi-cores because multiprocessing or os.fork is just faking it. This is an argument that belongs in the intellectual junk yard. It stems from the abuse of threads among Windows and Java developers, and is rooted in the absence of fork on Windows and the formerly slow fork on Solaris. And thus they are only able to think in terms of threads. If threading.Thread does not scale the way they want, they think multicores are out of reach.
Sturla, expressing out and out contempt for entire communities of capable, competent developers (both the creators of Windows and Java, and the users of those platforms) has no place on the core Python mailing lists. Please refrain from casually insulting entire groups of people merely because you don't approve of their technical choices.
The reason IPC in multiprocessing is slow is due to calling pickle, it is not the IPC in itself. A pipe or an Unix socket (named pipe on Windows) have the overhead of a memcpy in the kernel, which is equal to a memcpy plus some tiny constant overhead. And if you need two processes to share memory, there is something called shared memory. Thus, we can send data between processes just as fast as between subinterpreters.
Avoiding object serialisation is indeed the main objective. With subinterpreters, we have a lot more options for that than we do with any form of IPC, including shared references to immutable objects, and the PEP 3118 buffer API.
All in all, I think we are better off finding a better way to share Python objects between processes.
This is not an either/or question, as other folks remain free to work on improving multiprocessing's IPC efficiency if they want to. We don't seem to have folks clamouring at the door to work on that, though.
P.S. Another thing to note is that with sub-interpreters, you can forget about using ctypes or anything else that uses the simplified GIL API (e.g. certain Cython generated extensions).
Those aren't fundamental conceptual limitations, they're incidental limitations of the current design and implementation of the simplified GIL state API. One of the benefits of introducing a Python level API for subinterpreters is that it makes it easier to start testing, and hence fixing, some of those limitations (I actually just suggested to Eric off list that adding subinterpreter controls to _testcapi might be a good place to start, as that's beneficial regardless of what, if anything, ends up happening from a public API perspective) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan <ncoghlan@gmail.com> wrote:
Sturla, expressing out and out contempt for entire communities of capable, competent developers (both the creators of Windows and Java, and the users of those platforms) has no place on the core Python mailing lists. Please refrain from casually insulting entire groups of people merely because you don't approve of their technical choices.
I am not sure what you mean. Using threads on Windows and Java comes from a necessity, not because developers are incompetent. Windows does not provide a fork and processes are heavy-weight, hence multi-threading is the obvious choice.
Avoiding object serialisation is indeed the main objective.
Good.
With subinterpreters, we have a lot more options for that than we do with any form of IPC, including shared references to immutable objects, and the PEP 3118 buffer API.
Perhaps. One could do this with shared memory as well, but a complicating factor is that the base address must be the same (or corrected for). But one could probably do low-level magic with memory mapping to work around this. Particularly on 64-bit it is not really difficult to make sure a page is mapped to the same address in two processes. It is certainly easier to achieve within a process. But if the plan for Erlang-style "share nothing" threads is to pickle and memcpy objects, there is little or nothing to gain over using multiprocessing. Sturla
On 22 Jun 2015 01:45, "Sturla Molden" <sturla.molden@gmail.com> wrote:
Nick Coghlan <ncoghlan@gmail.com> wrote:
Sturla, expressing out and out contempt for entire communities of capable, competent developers (both the creators of Windows and Java, and the users of those platforms) has no place on the core Python mailing lists. Please refrain from casually insulting entire groups of people merely because you don't approve of their technical choices.
I am not sure what you mean. Using threads on Windows and Java comes from
a
necessity, not because developers are incompetent.
The folks *designing* Windows and Java are also people, and as creators of development platforms go, it's hard to dispute their success in helping folks solve real problems. We should be mindful of that when drawing lessons from their experience.
Windows does not provide a fork and processes are heavy-weight, hence multi-threading is the obvious choice.
With subinterpreters, we have a lot more options for that than we do with any form of IPC, including shared references to immutable objects, and the PEP 3118 buffer API.
Perhaps. One could do this with shared memory as well, but a complicating factor is that the base address must be the same (or corrected for). But one could probably do low-level magic with memory mapping to work around this. Particularly on 64-bit it is not really difficult to make sure a
is mapped to the same address in two processes.
It is certainly easier to achieve within a process. But if the plan for Erlang-style "share nothing" threads is to pickle and memcpy objects,
Windows actually has superior native parallel execution APIs to Linux in some respects, but open source programming languages tend not to support them, presumably due to a combination of Microsoft's longstanding hostile perspective on open source licencing (which seems to finally be moderating with their new CEO), and the even longer standing POSIX mindset that "fork and file descriptors ought to be enough for anyone" (even if the workload in the child processes is wildly different from that in the main process). asyncio addresses that problem for Python in regards to IOCP vs select (et al), and the configurable subprocess creation options addressed it for multiprocessing, but I'm not aware of any efforts to get greenlets to use fibres when they're available. page there
is little or nothing to gain over using multiprocessing.
The Python level *semantics* should be as if the objects were being copied (for ease of use), but the *implementation* should try to avoid actually doing that (for speed of execution). Assuming that can be done effectively *within* a process between subinterpreters, then the possibility arises of figuring out how to use shared memory to federate that approach across multiple processes. That could then provide a significant performance improvement for multiprocessing. But since we have the option of tackling the simpler problem of subinterpreters *first*, it makes sense to do that before diving into the cross-platform arcana involved in similarly improving the efficiency of multiprocessing's IPC. Regards, Nick.
Sturla
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, 22 Jun 2015 09:31:06 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Windows actually has superior native parallel execution APIs to Linux in some respects, but open source programming languages tend not to support them, presumably due to a combination of Microsoft's longstanding hostile perspective on open source licencing (which seems to finally be moderating with their new CEO), and the even longer standing POSIX mindset that "fork and file descriptors ought to be enough for anyone" (even if the workload in the child processes is wildly different from that in the main process).
Or perhaps the fact that those superiors APIs are a PITA. select() and friends may be crude performance-wise (though, strangely, we don't see providers migrating massively to Windows in order to improve I/O throughput), but they are simple to use. Regards Antoine.
On 22 Jun 2015 09:40, "Antoine Pitrou" <solipsis@pitrou.net> wrote:
On Mon, 22 Jun 2015 09:31:06 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Windows actually has superior native parallel execution APIs to Linux in some respects, but open source programming languages tend not to support them, presumably due to a combination of Microsoft's longstanding
perspective on open source licencing (which seems to finally be moderating with their new CEO), and the even longer standing POSIX mindset that "fork and file descriptors ought to be enough for anyone" (even if the workload in the child processes is wildly different from that in the main
hostile process).
Or perhaps the fact that those superiors APIs are a PITA. select() and friends may be crude performance-wise (though, strangely, we don't see providers migrating massively to Windows in order to improve I/O throughput), but they are simple to use.
Aye, there's a reason using a smart IDE like Visual Studio, IntelliJ or Eclipse is pretty much essential for both Windows and Java programming. These platforms fall squarely on the "tools maven" side of Oliver Steele's "IDE Divide": http://blog.osteele.com/posts/2004/11/ides/ The opportunity I think we have with Python is to put a cross platform text editor friendly abstraction layer across these kinds of underlying capabilities :) Cheers, Nick.
On Mon, 22 Jun 2015 09:47:29 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Or perhaps the fact that those superiors APIs are a PITA. select() and friends may be crude performance-wise (though, strangely, we don't see providers migrating massively to Windows in order to improve I/O throughput), but they are simple to use.
Aye, there's a reason using a smart IDE like Visual Studio, IntelliJ or Eclipse is pretty much essential for both Windows and Java programming. These platforms fall squarely on the "tools maven" side of Oliver Steele's "IDE Divide": http://blog.osteele.com/posts/2004/11/ides/
It's not about using an IDE, it's the more complex and delicate control flow that asynchronous IO (IOCP / Overlapped) imposes compared to non-blocking IO (e.g. select()). Not to mention that lifetime issues are hard to handle safely and generically before Vista (that is, before CancelIOEx(): https://msdn.microsoft.com/en-us/library/windows/desktop/aa363792%28v=3Dv= s.85%29.aspx -- "The CancelIoEx function allows you to cancel requests in threads other than the calling thread. The CancelIo function only cancels requests in the same thread that called the CancelIo function") Regards Antoine.
On 22/06/15 01:39, Antoine Pitrou wrote:
Or perhaps the fact that those superiors APIs are a PITA.
Not all of them, no. HeapAlloc is a good example. Very easy to use, and the "one heap per thread" design often gives excellent performance compared to a single global heap. But on Linux we only have malloc et al., allocating from the global heap. How many Linux programmers have even considered using multiple heaps in combination with multi-threading? I can assure you it is not common. A good idea is to look at the Python C API. We have PyMem_Malloc, but nothing that compares to Windows' HeapAlloc. Not only does HeapAlloc remove the contention for the global heap, it can also serialize. Instead of serializing an object by traversing all references in the object tree, we just serialize the heap from which it was allocated. And as for garbage collection, why not deallocate the whole heap in one blow? Is the any reason to pair each malloc with free if one could just zap the whole heap? That is what HeapDestroy does. On Linux we would typically homebrew a memory pool to achieve the same thing. But a memory pool needs to traverse a chain of pointers and call free() multiple times, each time with contention for the spinlock protecting the global heap. And when allocating from a memory pool we also have contention for the global heap. It cannot in any way compare to the performance of the Win API HeapCreate/HeapDestroy and HeapAlloc/HeapFree. Sturla
On Jun 21, 2015, at 06:09, Nick Coghlan <ncoghlan@gmail.com> wrote:
Avoiding object serialisation is indeed the main objective. With subinterpreters, we have a lot more options for that than we do with any form of IPC, including shared references to immutable objects, and the PEP 3118 buffer API.
It seems like you could provide a way to efficiently copy and share deeper objects than integers and buffers without sharing everything, assuming the user code knows, at the time those objects are created, that they will be copied or shared. Basically, you allocate the objects into a separate arena (along with allocating their refcounts on a separate page, as already mentioned). You can't add a reference to an outside object in an arena-allocated object, although you can copy that outside object into the arena. And then you just pass or clone (possibly by using CoW memory-mapping calls, only falling back to memcpy on platforms that can't do that) entire arenas instead of individual objects (so you don't need the fictitious memdeepcpy function that someone ridiculed earlier in this thread, but you get 90% of the benefits of having one). This has the same basic advantages of forking, but it's doable efficiently on Windows, and doable less efficiently (but still better than spawn and pass) on even weird embedded platforms, and it forces code to be explicit about what gets shared and copied without forcing it to work through less-natural queue-like APIs. Also, it seems like you could fake this entire arena API on top of pickle/copy for a first implementation, then just replace the underlying implementation separately.
On Sun, Jun 21, 2015 at 3:24 PM, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
On Jun 21, 2015, at 06:09, Nick Coghlan <ncoghlan@gmail.com> wrote:
Avoiding object serialisation is indeed the main objective. With subinterpreters, we have a lot more options for that than we do with any form of IPC, including shared references to immutable objects, and the PEP 3118 buffer API.
It seems like you could provide a way to efficiently copy and share deeper objects than integers and buffers without sharing everything, assuming the user code knows, at the time those objects are created, that they will be copied or shared. Basically, you allocate the objects into a separate arena (along with allocating their refcounts on a separate page, as already mentioned). You can't add a reference to an outside object in an arena-allocated object, although you can copy that outside object into the arena. And then you just pass or clone (possibly by using CoW memory-mapping calls, only falling back to memcpy on platforms that can't do that) entire arenas instead of individual objects (so you don't need the fictitious memdeepcpy function that someone ridiculed earlier in this thread, but you get 90% of the benefits of having one).
Yeah, I've been thinking of something along these lines. However, it's not the #1 issue to address so I haven't gotten too far into it. -eric
This has the same basic advantages of forking, but it's doable efficiently on Windows, and doable less efficiently (but still better than spawn and pass) on even weird embedded platforms, and it forces code to be explicit about what gets shared and copied without forcing it to work through less-natural queue-like APIs.
Also, it seems like you could fake this entire arena API on top of pickle/copy for a first implementation, then just replace the underlying implementation separately.
On Sun, Jun 21, 2015 at 9:41 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
However, from the perspective of multi-core parallel computing, I am not sure what this offers over using multiple processes.
Yes, you avoid the process startup time, but on POSIX systems a fork is very fast. An certainly, forking is much more efficient than serializing Python objects. It then boils down to a workaround for the fact that Windows cannot fork, which makes it particularly bad for running CPython. You also have to start up a subinterpreter and a thread, which is not instantaneous. So I am not sure there is a lot to gain here over calling os.fork.
That's all very well for sending stuff *to* a subprocess. If you fork for a single job, do the job, and have the subprocess send the result directly back to the origin (eg over its socket), then terminate, then sure, you don't need a lot of IPC. But for models where there's ongoing work, maybe interacting with other subinterpreters periodically, there could be a lot of benefit. It's very easy to slip into a CGI style of mentality where requests are entirely fungible and independent, and all you're doing is parallelization, but not everything fits into that model :) I run a MUD server, for instance, where currently every connection gets its own thread; if I wanted to make use of multiple CPU cores, I would not want to have the connections handled by separate processes, because they are constantly interacting with each other, so IPC would get expensive. ChrisA
On Sun, Jun 21, 2015 at 5:41 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
From the perspective of software design, it would be good it the CPython interpreter provided an environment instead of using global objects. It would mean that all functions in the C API would need to take the environment pointer as their first variable, which will be a major rewrite. It would also allow the "one interpreter per thread" design similar to tcl and .NET application domains.
While perhaps a worthy goal, I don't know that it fits in well with my goals. I'm aiming for an improved multi-core story with a minimum of change in the interpreter.
However, from the perspective of multi-core parallel computing, I am not sure what this offers over using multiple processes.
Yes, you avoid the process startup time, but on POSIX systems a fork is very fast. An certainly, forking is much more efficient than serializing Python objects.
You still need the mechanism to safely and efficiently share (at least some) objects between interpreters after forking. I expect this will be simpler within the same process.
It then boils down to a workaround for the fact that Windows cannot fork, which makes it particularly bad for running CPython.
We cannot leave Windows out in the cold.
You also have to start up a subinterpreter and a thread, which is not instantaneous. So I am not sure there is a lot to gain here over calling os.fork.
One key difference is that with a subinterpreter you are basically starting with a clean slate. The isolation between interpreters extends to the initial state. That level of isolation is a desirable feature because you can more clearly reason about the state of the running tasks.
A non-valid argument for this kind of design is that only code which uses threads for parallel computing is "real" multi-core code. So Python does not support multi-cores because multiprocessing or os.fork is just faking it. This is an argument that belongs in the intellectual junk yard. It stems from the abuse of threads among Windows and Java developers, and is rooted in the absence of fork on Windows and the formerly slow fork on Solaris. And thus they are only able to think in terms of threads. If threading.Thread does not scale the way they want, they think multicores are out of reach.
Well, perception is 9/10ths of the law. :) If the multi-core problem is already solved in Python then why does it fail in the court of public opinion. The perception that Python lacks a good multi-core story is real, leads organizations away from Python, and will not improve without concrete changes. Contrast that with Go or Rust or many other languages that make it simple to leverage multiple cores (even if most people never need to).
So the question is, how do you want to share objects between subinterpreters? And why is it better than IPC, when your idea is to isolate subinterpreters like application domains?
In return, my question is, what is the level of effort to get fork+IPC to do what we want vs. subinterpreters? Note that we need to accommodate Windows as more than an afterthought (or second-class citizen), as well as other execution environments (e.g. embedded) where we may not be able to fork.
If you think avoiding IPC is clever, you are wrong. IPC is very fast, in fact programs written to use MPI tends to perform and scale better than programs written to use OpenMP in parallel computing.
I'd love to learn more about that. I'm sure there are some great lessons on efficiently and safely sharing data between isolated execution environments. That said, how does IPC compare to passing objects around within the same process?
Not only is IPC fast, but you also avoid an issue called "false sharing", which can be even more detrimental than the GIL: You have parallel code, but it seems to run in serial, even though there is no explicit serialization anywhere. And by since Murphy's law is working against us, Python reference counts will be false shared unless we use multiple processes.
Solving reference counts in this situation is a separate issue that will likely need to be resolved, regardless of which machinery we use to isolate task execution.
The reason IPC in multiprocessing is slow is due to calling pickle, it is not the IPC in itself. A pipe or an Unix socket (named pipe on Windows) have the overhead of a memcpy in the kernel, which is equal to a memcpy plus some tiny constant overhead. And if you need two processes to share memory, there is something called shared memory. Thus, we can send data between processes just as fast as between subinterpreters.
IPC sounds great, but how well does it interact with Python's memory management/allocator? I haven't looked closely but I expect that multiprocessing does not use IPC anywhere.
All in all, I think we are better off finding a better way to share Python objects between processes.
I expect that whatever solution we would find for subinterpreters would have a lot in common with the same thing for processes.
P.S. Another thing to note is that with sub-interpreters, you can forget about using ctypes or anything else that uses the simplified GIL API (e.g. certain Cython generated extensions).
On the one hand there are some rough edges with subinterpreters that need to be fixed. On the other hand, we will have to restrict the subinterpreter model (at least initially) in ways that would likely preclude operation of existing extension modules. -eric
On 24/06/15 07:01, Eric Snow wrote:
In return, my question is, what is the level of effort to get fork+IPC to do what we want vs. subinterpreters? Note that we need to accommodate Windows as more than an afterthought
Windows is really the problem. The absence of fork() is especially hurtful for an interpreted language like Python, in my opinion.
If you think avoiding IPC is clever, you are wrong. IPC is very fast, in fact programs written to use MPI tends to perform and scale better than programs written to use OpenMP in parallel computing.
I'd love to learn more about that. I'm sure there are some great lessons on efficiently and safely sharing data between isolated execution environments. That said, how does IPC compare to passing objects around within the same process?
There are two major competing standards for parallel computing in science and engineering: OpenMP and MPI. OpenMP is based on a shared memory model. MPI is based on a distributed memory model and use message passing (hence its name). The common implementations of OpenMP (GNU, Intel, Microsoft) are all implemented with threads. There are also OpenMP implementations for clusters (e.g. Intel), but from the programmer's perspective OpenMP is a shared memory model. The common implementations of MPI (MPICH, OpenMPI, Microsoft MPI) use processes instead of threads. Processes can run on the same computer or on different computers (aka "clusters"). On localhost shared memory is commonly used for message passing, on clusters MPI implementations will use networking protocols. The take-home message is that OpenMP is conceptually easier to use, but programs written to use MPI tend to be faster and scale better. This is even true when using a single computer, e.g. a laptop with one multicore CPU. Here is tl;dr explanation: As for ease of programming, it is easier to create a deadlock or livelock with MPI than OpenMP, even though programs written to use MPI tend to need fewer synchronization points. There is also less boilerplate code to type when using OpenMP, because we do not have to code object serialization, message passing, and object deserialization. For performance, programs written to use MPI seems to have a larger overhead because they require object serialization and message passing, whereas OpenMP threads can just share the same objects. The reality is actually the opposite, and is due to the internals of modern CPU, particularly hierarchichal memory, branch prediction and long pipelines. Because of hierarchichal memory, the cache used by CPUs and CPU cores must be kept in synch. Thus when using OpenMP (threads) there will be a lot of synchronization going on that the programmer does not see, but which the hardware will do behind the scenes. There will also be a lot of data passing between various cache levels on the CPU and RAM. If a core writes to a pice of memory it keeps in a cache line, a cascade of data traffic and synchronization can be triggered across all CPUs and cores. Not only will this stop the CPUs and prompt them to synchronize cache with RAM, it also invalidates their branch prediction and they must flush their pipelines and throw away work they have already done. The end result is a program that does not scale or perform very well, even though it does not seem to have any explicit synchronization points that could explain this. The term "false sharing" is often used to describe this problem. Programs written to use MPI are the opposite. There every instance of synchronization and message passing is visible. When a CPU core writes to memory kept in a cache line, it will never trigger synchronization and data traffic across all the CPUs. The scalability is as the program predicts. And even though memory and objects are not shared, there is actually much less data traffic going on. Which to use? Most people find it easier to use OpenMP, and it does not require a big runtime environment to be installed. But programs using MPI tend to be the faster and more scalable. If you need to ensure scalability on multicores, multiple processes are better than multiple threads. The scalability of MPI also applies to Python's multiprocessing. It is the isolated virtual memory of each process that allows the cores to run at full speed. Another thing to note is that Windows is not a second-class citizen when using MPI. The MPI runtime (usually an executable called mpirun or mpiexec) starts and manages a group of processes. It does not matter if they are started by fork() or CreateProcess().
Solving reference counts in this situation is a separate issue that will likely need to be resolved, regardless of which machinery we use to isolate task execution.
As long as we have a GIL, and we need the GIL to update a reference count, it does not hurt so much as it otherwise would. The GIL hides most of the scalability impact by serializing flow of execution.
IPC sounds great, but how well does it interact with Python's memory management/allocator? I haven't looked closely but I expect that multiprocessing does not use IPC anywhere.
multiprocessing does use IPC. Otherwise the processes could not communicate. One example is multiprocessing.Queue, which uses a pipe and a semaphore. Sturla
Hi! On Wed, Jun 24, 2015 at 05:26:59PM +0200, Sturla Molden <sturla.molden@gmail.com> wrote:
The absence of fork() is especially hurtful for an interpreted language like Python, in my opinion.
I don't think fork is of major help for interpreted languages. When most of your "code" is actually data most of your data pages are prone to copy-on-write slowdown.
Sturla
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Wed, Jun 24, 2015 at 8:27 AM Sturla Molden <sturla.molden@gmail.com> wrote:
On 24/06/15 07:01, Eric Snow wrote:
In return, my question is, what is the level of effort to get fork+IPC to do what we want vs. subinterpreters? Note that we need to accommodate Windows as more than an afterthought
Windows is really the problem. The absence of fork() is especially hurtful for an interpreted language like Python, in my opinion.
You cannot assume that fork() is safe on any OS as a general solution for anything. This isn't a Windows specific problem, It simply cannot be relied upon in a general purpose library at all. It is incompatible with threads. The ways fork() can be used safely are in top level application decisions: There must be a guarantee of no threads running before all forking is done. (thus the impossibility of relying on it as a mechanism to do anything useful in a generic library - you are a library, you don't know what the whole application is doing or when you were called as part of it) A concurrency model that assumes that it is fine to fork() and let child processes continue to execute is not usable by everyone. (ie: multiprocessing until http://bugs.python.org/issue8713 was implemented). -gps
If you think avoiding IPC is clever, you are wrong. IPC is very fast, in fact programs written to use MPI tends to perform and scale better than programs written to use OpenMP in parallel computing.
I'd love to learn more about that. I'm sure there are some great lessons on efficiently and safely sharing data between isolated execution environments. That said, how does IPC compare to passing objects around within the same process?
There are two major competing standards for parallel computing in science and engineering: OpenMP and MPI. OpenMP is based on a shared memory model. MPI is based on a distributed memory model and use message passing (hence its name).
The common implementations of OpenMP (GNU, Intel, Microsoft) are all implemented with threads. There are also OpenMP implementations for clusters (e.g. Intel), but from the programmer's perspective OpenMP is a shared memory model.
The common implementations of MPI (MPICH, OpenMPI, Microsoft MPI) use processes instead of threads. Processes can run on the same computer or on different computers (aka "clusters"). On localhost shared memory is commonly used for message passing, on clusters MPI implementations will use networking protocols.
The take-home message is that OpenMP is conceptually easier to use, but programs written to use MPI tend to be faster and scale better. This is even true when using a single computer, e.g. a laptop with one multicore CPU.
Here is tl;dr explanation:
As for ease of programming, it is easier to create a deadlock or livelock with MPI than OpenMP, even though programs written to use MPI tend to need fewer synchronization points. There is also less boilerplate code to type when using OpenMP, because we do not have to code object serialization, message passing, and object deserialization.
For performance, programs written to use MPI seems to have a larger overhead because they require object serialization and message passing, whereas OpenMP threads can just share the same objects. The reality is actually the opposite, and is due to the internals of modern CPU, particularly hierarchichal memory, branch prediction and long pipelines.
Because of hierarchichal memory, the cache used by CPUs and CPU cores must be kept in synch. Thus when using OpenMP (threads) there will be a lot of synchronization going on that the programmer does not see, but which the hardware will do behind the scenes. There will also be a lot of data passing between various cache levels on the CPU and RAM. If a core writes to a pice of memory it keeps in a cache line, a cascade of data traffic and synchronization can be triggered across all CPUs and cores. Not only will this stop the CPUs and prompt them to synchronize cache with RAM, it also invalidates their branch prediction and they must flush their pipelines and throw away work they have already done. The end result is a program that does not scale or perform very well, even though it does not seem to have any explicit synchronization points that could explain this. The term "false sharing" is often used to describe this problem.
Programs written to use MPI are the opposite. There every instance of synchronization and message passing is visible. When a CPU core writes to memory kept in a cache line, it will never trigger synchronization and data traffic across all the CPUs. The scalability is as the program predicts. And even though memory and objects are not shared, there is actually much less data traffic going on.
Which to use? Most people find it easier to use OpenMP, and it does not require a big runtime environment to be installed. But programs using MPI tend to be the faster and more scalable. If you need to ensure scalability on multicores, multiple processes are better than multiple threads. The scalability of MPI also applies to Python's multiprocessing. It is the isolated virtual memory of each process that allows the cores to run at full speed.
Another thing to note is that Windows is not a second-class citizen when using MPI. The MPI runtime (usually an executable called mpirun or mpiexec) starts and manages a group of processes. It does not matter if they are started by fork() or CreateProcess().
Solving reference counts in this situation is a separate issue that will likely need to be resolved, regardless of which machinery we use to isolate task execution.
As long as we have a GIL, and we need the GIL to update a reference count, it does not hurt so much as it otherwise would. The GIL hides most of the scalability impact by serializing flow of execution.
IPC sounds great, but how well does it interact with Python's memory management/allocator? I haven't looked closely but I expect that multiprocessing does not use IPC anywhere.
multiprocessing does use IPC. Otherwise the processes could not communicate. One example is multiprocessing.Queue, which uses a pipe and a semaphore.
Sturla
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Jun 24, 2015 at 12:31 PM, Gregory P. Smith <greg@krypto.org> wrote:
You cannot assume that fork() is safe on any OS as a general solution for anything. This isn't a Windows specific problem, It simply cannot be relied upon in a general purpose library at all. It is incompatible with threads.
The ways fork() can be used safely are in top level application decisions: There must be a guarantee of no threads running before all forking is done. (thus the impossibility of relying on it as a mechanism to do anything useful in a generic library - you are a library, you don't know what the whole application is doing or when you were called as part of it)
A concurrency model that assumes that it is fine to fork() and let child processes continue to execute is not usable by everyone. (ie: multiprocessing until http://bugs.python.org/issue8713 was implemented).
Another way of looking at it is that a concurrency model that assumes it is fine to thread and let child threads continue to execute is not usable by everyone. IMO the lesson here is don't start threads *or* fork processes behind the scenes without explicitly allowing your callers to override you, so that the top level app can orchestrate everything appropriately. This is especially important in Python, where forking is one of the best ways of getting single-machine multicore processing. Interestingly, the worker threads in OP can probably be made fork-safe. Not sure that's especially useful, but I can imagine. -- Devin
On Wed, Jun 24, 2015 at 9:26 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 24/06/15 07:01, Eric Snow wrote: There are two major competing standards for parallel computing in science and engineering: OpenMP and MPI. OpenMP is based on a shared memory model. MPI is based on a distributed memory model and use message passing (hence its name). [snip]
Thanks for the great explanation!
Solving reference counts in this situation is a separate issue that will likely need to be resolved, regardless of which machinery we use to isolate task execution.
As long as we have a GIL, and we need the GIL to update a reference count, it does not hurt so much as it otherwise would. The GIL hides most of the scalability impact by serializing flow of execution.
It does hurt in COW situations, e.g. forking. My expectation is that we'll at least need to take a serious look into the matter in the short term (i.e. Python 3.6).
IPC sounds great, but how well does it interact with Python's memory management/allocator? I haven't looked closely but I expect that multiprocessing does not use IPC anywhere.
multiprocessing does use IPC. Otherwise the processes could not communicate. One example is multiprocessing.Queue, which uses a pipe and a semaphore.
Right. I don't know quite what I was thinking. :) -eric
On 25/06/15 00:19, Eric Snow wrote:
Solving reference counts in this situation is a separate issue that will likely need to be resolved, regardless of which machinery we use to isolate task execution.
As long as we have a GIL, and we need the GIL to update a reference count, it does not hurt so much as it otherwise would. The GIL hides most of the scalability impact by serializing flow of execution.
It does hurt in COW situations, e.g. forking. My expectation is that we'll at least need to take a serious look into the matter in the short term (i.e. Python 3.6).
Yes. It hurts performance after forking as reference counting will trigger a lot of page copies. Keeping reference counts in separate pages and replacing the field in the PyObject struct would reduce this problem by a factor of up to 512 (64 bit) or 1024 (32 bit). It does not hurt performance with multi-threading, as Python threads are serialized by the GIL. But if the GIL was removed it would result in a lot of false sharing. That is a major reason we need a tracing garbage collector instead of reference counting if we shall be able to remove the GIL. Sturla
On Wed, Jun 24, 2015 at 05:26:59PM +0200, Sturla Molden wrote:
On 24/06/15 07:01, Eric Snow wrote:
In return, my question is, what is the level of effort to get fork+IPC to do what we want vs. subinterpreters? Note that we need to accommodate Windows as more than an afterthought
Windows is really the problem. The absence of fork() is especially hurtful for an interpreted language like Python, in my opinion.
UNIX is really the problem. The absence of tiered interrupt request levels, memory descriptor lists, I/O request packets (Irps), thread agnostic I/O, non-paged kernel memory, non-overcommitted memory management, universal page/buffer cache, better device driver architecture and most importantly, a kernel architected around waitable events, not processes, is harmful for efficiently solving contemporary optimally with modern hardware. VMS got it right from day one. UNIX did not. :-) Trent.
On 26 Jun 2015 05:37, "Trent Nelson" <trent@snakebite.org> wrote:
On Wed, Jun 24, 2015 at 05:26:59PM +0200, Sturla Molden wrote:
On 24/06/15 07:01, Eric Snow wrote:
In return, my question is, what is the level of effort to get fork+IPC to do what we want vs. subinterpreters? Note that we need to accommodate Windows as more than an afterthought
Windows is really the problem. The absence of fork() is especially
hurtful
for an interpreted language like Python, in my opinion.
UNIX is really the problem. The absence of tiered interrupt request levels, memory descriptor lists, I/O request packets (Irps), thread agnostic I/O, non-paged kernel memory, non-overcommitted memory management, universal page/buffer cache, better device driver architecture and most importantly, a kernel architected around waitable events, not processes, is harmful for efficiently solving contemporary optimally with modern hardware.
Platforms are what they are :) As a cross-platform, but still platform dependent, language runtime, we're actually in a pretty good position to help foster some productive competition between Windows and the *nix platforms. However, we'll only be able to achieve that if we approach their wildly divergent execution and development models with respect for their demonstrated success and seek to learn from their respective strengths, rather than dismissing them over their respective weaknesses :) Cheers, Nick.
On 24/06/15 07:01, Eric Snow wrote:
Well, perception is 9/10ths of the law. :) If the multi-core problem is already solved in Python then why does it fail in the court of public opinion. The perception that Python lacks a good multi-core story is real, leads organizations away from Python, and will not improve without concrete changes.
I think it is a combination of FUD and the lack of fork() on Windows. There is a lot of utterly wrong information about CPython and its GIL. The reality is that Python is used on even the largest supercomputers. The scalability problem that is seen on those systems is not the GIL, but the module import. If we have 1000 CPython processes importing modules like NumPy simultaneously, they will do a "denial of service attack" on the file system. This happens when the module importer generates a huge number of failed open() calls while trying to locate the module files. There is even described in a paper on how to avoid this on an IBM Blue Brain: "As an example, on Blue Gene P just starting up Python and importing NumPy and GPAW with 32768 MPI tasks can take 45 minutes!" http://www.cs.uoregon.edu/research/paracomp/papers/iccs11/iccs_paper_final.p... And while CPython is being used for massive parallel computing to e.g. model the global climate system, there is this FUD that CPython does not even scale up on a laptop with a single multicore CPU. I don't know where it is coming from, but it is more FUD than truth. The main answers to FUD about the GIL and Python in scientific computing are these: 1. Python in itself generates a 200x to 2000x performance hit compared to C or Fortran. Do not write compute kernels in Python, unless you can compile with Cython or Numba. If you have need for speed, start by moving the performance critical parts to Cython instead of optimizing for a few CPU cores. 2. If you can release the GIL, e.g. in Cython code, Python threads scale like any other native OS thread. They are real threads, not fake threads in the interpreter. 3. The 80-20, 90-10, or 99-1 rule: The majority of the code accounts for a small portion of the runtime. It is wasteful to optimize "everything". The more speed you need, the stronger this asymmetry will be. Identify the bottlenecks with a profiler and optimize those. 4. Using C or Java does not give you ha faster hard-drive or faster network connection. You cannot improve on network access by using threads in C or Java instead of threads in Python. If your code is i/o bound, Python's GIL does not matter. Python threads do execute i/o tasks in parallel. (This is the major misunderstanding.) 5. Computational intensive parts of a program is usually taken case of in libraries like BLAS, LAPACK, and FFTW. The Fortran code in LAPACK does not care if you called it from Python. It will be as fast as it can be, independent of Python. The Fortran code in LAPACK also have no concept of Python's GIL. LAPACK libraries like Intel MKL can use threads internally without asking Python for permission. 6. The scalability problem when using Python on a massive supercomputer is not the GIL but the module import. 7. When using OpenCL we write kernels as plain text. Python is excellent at manipulating text, more so than C. This also applies to using OpenGL for computer graphics with GLSL shaders and vetexbuffer objects. If you need the GPU, you can just as well use Python on the CPU. Sturla
On Wed, Jun 24, 2015 at 10:28 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 24/06/15 07:01, Eric Snow wrote:
Well, perception is 9/10ths of the law. :) If the multi-core problem is already solved in Python then why does it fail in the court of public opinion. The perception that Python lacks a good multi-core story is real, leads organizations away from Python, and will not improve without concrete changes.
I think it is a combination of FUD and the lack of fork() on Windows. There is a lot of utterly wrong information about CPython and its GIL.
Thanks for a clear summary of the common misunderstandings. While I agreed with your points, they are mostly the same things we have been communicating for many years, to no avail. They are also oriented toward larger-scale parallelism (which I don't mean to discount). That makes it easier to misunderstand. Why? Because there are enough caveats and performance downsides (see Dave Beazley's PyCon 2015 talk) that most folks stop trying to rationalize, throw their hands up, and say "Python concurrency stinks" and "you can't *really* do multicore on Python". I have personal experience with high-profile decision makers where this is exactly what happened, with adverse consequences to support for Python within the organizations. To change this perception we need to give folks a simpler, performant concurrency model that takes advantage of multiple cores. My proposal is all about doing at least *something* that makes Python's multi-core story obvious and undeniable. *That* is my entire goal with this proposal. Clearly I have opinions on the best approach to achieve that in the 3.6 timeframe. :) However, I am quite willing to investigate all the options (as I hope this thread demonstrates). So, again, thanks for the feedback and insight. You've provided me with plenty of food for thought. -eric
On 25/06/15 00:56, Eric Snow wrote:
Why? Because there are enough caveats and performance downsides (see Dave Beazley's PyCon 2015 talk) that most folks stop trying to rationalize, throw their hands up, and say "Python concurrency stinks" and "you can't *really* do multicore on Python".
Yes, that seems to be the case.
To change this perception we need to give folks a simpler, performant concurrency model that takes advantage of multiple cores. My proposal is all about doing at least *something* that makes Python's multi-core story obvious and undeniable.
I think the main issue with subinterpreters and a message-passing model is that it will be very difficult to avoid deep copies of Python objects. And in that case all we have achieved compared to multiprocessing is less scalability. Also you have not removed the GIL, so the FUD about the dreaded GIL will still be around. Clearly introducing multiprocessing in the standard library did nothing to reduce this. Sturla
On Wed, Jun 24, 2015 at 3:10 PM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
So there's two reasons I can think of to use threads for CPU parallelism:
- My thing does a lot of parallel work, and so I want to save on memory by sharing an address space
This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this.
Not necessarily true... e.g., see two threads from yesterday (!) on the pandas mailing list, from users who want to perform queries against a large data structure shared between threads/processes: https://groups.google.com/d/msg/pydata/Emkkk9S9rUk/eh0nfiGR7O0J https://groups.google.com/forum/#!topic/pydata/wOwe21I65-I ("Are we just screwed on windows?") -n -- Nathaniel J. Smith -- http://vorpus.org
On Wed, Jun 24, 2015 at 04:55:31PM -0700, Nathaniel Smith wrote:
On Wed, Jun 24, 2015 at 3:10 PM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
So there's two reasons I can think of to use threads for CPU parallelism:
- My thing does a lot of parallel work, and so I want to save on memory by sharing an address space
This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this.
Not necessarily true... e.g., see two threads from yesterday (!) on the pandas mailing list, from users who want to perform queries against a large data structure shared between threads/processes:
https://groups.google.com/d/msg/pydata/Emkkk9S9rUk/eh0nfiGR7O0J https://groups.google.com/forum/#!topic/pydata/wOwe21I65-I ("Are we just screwed on windows?")
Ironically (not knowing anything about Pandas' implementation details other than... "Cython... and NumPy"), there should be no difference between getting a Pandas DataFrame available to PyParallel and a NumPy ndarray or Cythonized C-struct (like datrie). The situation Ryan describes is literally the exact situation that PyParallel excels at: large reference data structures accessible in parallel contexts. Trent.
Trent Nelson <trent@snakebite.org> wrote:
The situation Ryan describes is literally the exact situation that PyParallel excels at: large reference data structures accessible in parallel contexts.
Back in 2009 I solved this for multiprocessing using a NumPy array that used shared memory as backend (Sys V IPC, not BSD mmap, on mac and Linux). By monkey-patching the pickling of numpy.ndarray, the contents of the shared memory buffer was not pickled, only the metadata needed to reopen the shared memory. After a while it stopped working on Mac (I haven't had time to fix it -- maybe I should), but it still works on Windows. :( Anyway, there is another library that does something similar called joblib. It is used for parallel computing in scikit-learn. It creates shared memory by mmap from /tmp, which means it is only shared memory on Linux. On Mac and Window there is no tmpfs so it ends up using a physical file on disk instead :-( Sturla
On Wed, Jun 24, 2015 at 10:28 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
The reality is that Python is used on even the largest supercomputers. The scalability problem that is seen on those systems is not the GIL, but the module import. If we have 1000 CPython processes importing modules like NumPy simultaneously, they will do a "denial of service attack" on the file system. This happens when the module importer generates a huge number of failed open() calls while trying to locate the module files.
There is even described in a paper on how to avoid this on an IBM Blue Brain: "As an example, on Blue Gene P just starting up Python and importing NumPy and GPAW with 32768 MPI tasks can take 45 minutes!"
I'm curious what difference there is under Python 3.4 (or even 3.3). Along with being almost entirely pure Python, the import system now has some optimizations that help mitigate filesystem access (particularly stats). Regardless, have there been any attempts to address this situation? I'd be surprised if there haven't. :) Is the solution described in the cited paper sufficient? Earlier Barry brought up Emac's unexec as at least an inspiration for a solution. I expect there are a number of approaches. It would be nice to address this somehow (though unrelated to my multi-core proposal). I would expect that it could also have bearing on interpreter start-up time. If it's worth pursuing then consider posting something to import-sig. -eric
On Thu, 25 Jun 2015 at 02:57 Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Wed, Jun 24, 2015 at 10:28 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
The reality is that Python is used on even the largest supercomputers. The scalability problem that is seen on those systems is not the GIL, but the module import. If we have 1000 CPython processes importing modules like NumPy simultaneously, they will do a "denial of service attack" on the file system. This happens when the module importer generates a huge number of failed open() calls while trying to locate the module files.
There is even described in a paper on how to avoid this on an IBM Blue Brain: "As an example, on Blue Gene P just starting up Python and importing NumPy and GPAW with 32768 MPI tasks can take 45 minutes!"
I'm curious what difference there is under Python 3.4 (or even 3.3). Along with being almost entirely pure Python, the import system now has some optimizations that help mitigate filesystem access (particularly stats).
From the HPC setup that I use there does appear to be some difference. The number of syscalls required to import numpy is significantly lower with 3.3 than 2.7 in our setup (I don't have 3.4 in there and I didn't compile either of these myself):
$ strace python3.3 -c "import numpy" 2>&1 | egrep -c '(open|stat)' 1315 $ strace python2.7 -c "import numpy" 2>&1 | egrep -c '(open|stat)' 4444 It doesn't make any perceptible difference when running "time python -c 'import numpy'" on the login node. I'm not going to request 1000 cores in order to test the difference properly. Also note that profiling in these setups is often complicated by the other concurrent users of the system. -- Oscar
On 25 June 2015 at 02:28, Sturla Molden <sturla.molden@gmail.com> wrote:
On 24/06/15 07:01, Eric Snow wrote:
Well, perception is 9/10ths of the law. :) If the multi-core problem is already solved in Python then why does it fail in the court of public opinion. The perception that Python lacks a good multi-core story is real, leads organizations away from Python, and will not improve without concrete changes.
I think it is a combination of FUD and the lack of fork() on Windows. There is a lot of utterly wrong information about CPython and its GIL.
The reality is that Python is used on even the largest supercomputers. The scalability problem that is seen on those systems is not the GIL, but the module import. If we have 1000 CPython processes importing modules like NumPy simultaneously, they will do a "denial of service attack" on the file system. This happens when the module importer generates a huge number of failed open() calls while trying to locate the module files.
Slight tangent, but folks hitting this issue on 2.7 may want to investigate Eric's importlib2: https://pypi.python.org/pypi/importlib2 It switches from stat-based searching for files to the Python 3.3+ model of directory listing based searches, which can (anecdotally) lead to a couple of orders of magnitude of improvement in startup for code loading modules from NFS mounts.
And while CPython is being used for massive parallel computing to e.g. model the global climate system, there is this FUD that CPython does not even scale up on a laptop with a single multicore CPU. I don't know where it is coming from, but it is more FUD than truth.
Like a lot of things in the vast sprawling Python ecosystem, I think there are aspects of this that are a discoverabiilty problem moreso than a capability problem. When you're first experimenting with parallel execution, a lot of the time folks start with computational problems like executing multiple factorials at once. That's trivial to do across multiple cores even with a threading model like JavaScript's worker threads, but can't be done in CPython without reaching for the multiprocessing module. This is the one place where I'll concede that folks learning to program on Windows or the JVM and hence getting the idea that "creating threads is fast, creating processes is slow" causes problems: folks playing this kind of thing are far more likely to go "import threading" than they are "import multiprocessing" (and likewise for the ThreadPoolExecutor vs the ProcessPoolExecutor if using concurrent.futures), and their reaction when it doesn't work is far more likely to be "Python can't do this" than it is "I need to do this differently in Python from the way I do it in C/C++/Java/JavaScript".
The main answers to FUD about the GIL and Python in scientific computing are these:
It generally isn't scientific programmers I personally hit problems with (although we have to allow for the fact many of the scientists I know I met *because* they're Pythonistas). For that use case, there's not only HPC to point to, but a number of papers that talking about Cython and Numba in the same breath as C, C++ and FORTRAN, which is pretty spectacular company to be in when it comes to numerical computation. Being the fourth language Nvidia supported directly for CUDA doesn't hurt either. Instead, the folks that I think have a more valid complaint are the games developers, and the folks trying to use games development as an educational tool. They're not doing array based programming the way numeric programmers are (so the speed of the NumPy stack isn't any help), and they're operating on shared game state and frequently chattering back and forth between threads of control, so high overhead message passing poses a major performance problem. That does suggest to me a possible "archetypal problem" for the work Eric is looking to do here: a 2D canvas with multiple interacting circles bouncing around. We'd like each circle to have its own computational thread, but still be able to deal with the collision physics when they run into each other. We'll assume it's a teaching exercise, so "tell the GPU to do it" *isn't* the right answer (although it might be an interesting entrant in a zoo of solutions). Key performance metric: frames per second Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 26 June 2015 at 00:08, Nick Coghlan <ncoghlan@gmail.com> wrote:
That does suggest to me a possible "archetypal problem" for the work Eric is looking to do here: a 2D canvas with multiple interacting circles bouncing around. We'd like each circle to have its own computational thread, but still be able to deal with the collision physics when they run into each other. We'll assume it's a teaching exercise, so "tell the GPU to do it" *isn't* the right answer (although it might be an interesting entrant in a zoo of solutions). Key performance metric: frames per second
The more I think about it, the more I think this (or at least something along these lines) makes sense as the archetypal problem to solve here. 1. It avoids any temptation to consider the problem potentially IO bound, as the only IO is rendering the computational results to the screen 2. Scaling across multiple machines clearly isn't relevant, since we're already bound to a single machine due to the fact we're rendering to a local display 3. The potential for collisions between objects means it isn't an embarrassingly parallel problem where the different computational threads can entirely ignore the existence of the other threads 4. "Frames per second" is a nice simple metric that can be compared across threading, multiprocessing, PyParallel, subinterpreters, mpi4py and perhaps even the GPU (which will no doubt thump the others soundly, but the comparison may still be interesting) 5. It's a problem domain where we know Python isn't currently a popular choice, and there are valid technical reasons (including this one) for that lack of adoption 6. It's a problem domain we know folks in the educational community are interested in seeing Python get better at, as building simple visual animations is often a good way to introduce programming in general (just look at the design of Scratch) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 25/06/15 16:31, Nick Coghlan wrote:
3. The potential for collisions between objects means it isn't an embarrassingly parallel problem where the different computational threads can entirely ignore the existence of the other threads
Well, you can have a loop that updates all particles, e.g. by calling a coroutine associated with each particle, and then this loop is an embarrassingly parallel problem. You don't need to associate each particle with its own thread. It is bad to teach students to use one thread per particle anyway. Suddenly they write a system that have thousands of threads. Sturla
On Thu, Jun 25, 2015 at 10:25 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 25/06/15 16:31, Nick Coghlan wrote:
3. The potential for collisions between objects means it isn't an
embarrassingly parallel problem where the different computational threads can entirely ignore the existence of the other threads
Well, you can have a loop that updates all particles, e.g. by calling a coroutine associated with each particle, and then this loop is an embarrassingly parallel problem. You don't need to associate each particle with its own thread.
It is bad to teach students to use one thread per particle anyway. Suddenly they write a system that have thousands of threads.
Understood that this is merely an example re: threading, but BSP seems to be the higher-level algorithm for iterative graphs with topology: * https://en.wikipedia.org/wiki/Bulk_synchronous_parallel * http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-go... * https://giraph.apache.org/ * https://spark.apache.org/docs/latest/ * https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-ap... (BSP) * https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.h... * https://spark.apache.org/docs/latest/api/python/ (no graphx BSP yet, unfortunately) * https://github.com/xslogic/phoebus (Erlang, HDFS, Thrift) * https://github.com/mnielsen/Pregel/blob/master/pregel.py (Python) Intra-machine optimization could also be useful.
On 06/25/2015 08:25 AM, Sturla Molden wrote:
On 25/06/15 16:31, Nick Coghlan wrote:
3. The potential for collisions between objects means it isn't an embarrassingly parallel problem where the different computational threads can entirely ignore the existence of the other threads
Well, you can have a loop that updates all particles, e.g. by calling a coroutine associated with each particle, and then this loop is an embarrassingly parallel problem. You don't need to associate each particle with its own thread.
It is bad to teach students to use one thread per particle anyway. Suddenly they write a system that have thousands of threads.
Speaking as a novice to this area, I do understand that what we learn with may not be (and usually isn't) production-ready code, I do see Nick's suggestion as being one that is easy to understand, easy to measure, and good for piquing interest. At least, I'm now interested. :) (look ma! bowling for circles!) -- ~Ethan~
On 26 Jun 2015 01:27, "Sturla Molden" <sturla.molden@gmail.com> wrote:
On 25/06/15 16:31, Nick Coghlan wrote:
3. The potential for collisions between objects means it isn't an embarrassingly parallel problem where the different computational threads can entirely ignore the existence of the other threads
Well, you can have a loop that updates all particles, e.g. by calling a
coroutine associated with each particle, and then this loop is an embarrassingly parallel problem. You don't need to associate each particle with its own thread.
It is bad to teach students to use one thread per particle anyway.
Suddenly they write a system that have thousands of threads. And when they hit that scaling limit is when they should need to learn why this simple approach doesn't scale very well, just as purely procedural programming doesn't handle increasing structural complexity and just as the "c10m" problem (like the "c10k" problem before it) is teaching our industry as a whole some important lessons about scalable hardware and software design: http://c10m.robertgraham.com/p/manifesto.html There are limits to the degree that education can be front loaded before all the pre-emptive "you'll understand why this is important later" concerns become a barrier to learning the fundamentals, rather than a useful aid. Sometimes folks really do need to encounter a problem themselves in order to appreciate the value of the more complex solutions that make it possible to get past those barriers. Cheers, Nick.
Sturla
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 25/06/15 16:08, Nick Coghlan wrote:
It generally isn't scientific programmers I personally hit problems with (although we have to allow for the fact many of the scientists I know I met *because* they're Pythonistas). For that use case, there's not only HPC to point to, but a number of papers that talking about Cython and Numba in the same breath as C, C++ and FORTRAN, which is pretty spectacular company to be in when it comes to numerical computation.
Cython can sometimes give the same performance as C or Fortran, but as soon as you start to use classes in the Cython code you run into GIL issues. It is not that the GIL is a problem per se, but because Cython compiles to C, the GIL is not released until the Cython function returns. That is, unless you manually release it inside Cython. This e.g. means that the interpreter might be locked for longer durations, and if you have a GUI it becomes unresponsive. The GIL is more painful in Cython than in Python. Personally I often end up writing a mix of Cython and C or C++. Numba is impressive but still a bit immature. It is an LLVM based JIT compiler for CPython that for simple computational tasks can give performance similar to C. It can also run Python code on Nvidia GPUs. Numba is becoming what the dead swallow should have been.
Instead, the folks that I think have a more valid complaint are the games developers, and the folks trying to use games development as an educational tool.
I have not developed games myself, but for computer graphics with OpenGL there is certainly no reason to complain. NumPy arrays are great for storing vertex and texture data. OpenGL with NumPy is just as fast as OpenGL with C arrays. GLSL shaders are just plain text, Python is great for that. Cython and Numba are both great if you call glVertex* functions the old way, doing this as fast as C. Display lists are also equally fast from Python and C. But if you start to call glVertex* multiple times from a Python loop, then you're screwed.
That does suggest to me a possible "archetypal problem" for the work Eric is looking to do here: a 2D canvas with multiple interacting circles bouncing around. We'd like each circle to have its own computational thread, but still be able to deal with the collision physics when they run into each other.
There are people doing Monte Carlo simulations with thousands or millions of particles, but not with one thread per particle. :-) Sturla
On Tue, Jun 23, 2015 at 11:01:24PM -0600, Eric Snow wrote:
On Sun, Jun 21, 2015 at 5:41 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
From the perspective of software design, it would be good it the CPython interpreter provided an environment instead of using global objects. It would mean that all functions in the C API would need to take the environment pointer as their first variable, which will be a major rewrite. It would also allow the "one interpreter per thread" design similar to tcl and .NET application domains.
While perhaps a worthy goal, I don't know that it fits in well with my goals. I'm aiming for an improved multi-core story with a minimum of change in the interpreter.
This slide and the following two are particularly relevant: https://speakerdeck.com/trent/parallelism-and-concurrency-with-python?slide=... I elicit three categories of contemporary problems where efficient use of multiple cores would be desirable: 1) Computationally-intensive work against large data sets (the traditional "parallel" HPC/science/engineering space, and lately, to today's "Big Data" space). 2a) Serving tens/hundreds of thousands of network clients with non-trivial computation required per-request (i.e. more than just buffer copying between two sockets); best example being the modern day web server, or: 2b) Serving far fewer clients, but striving for the lowest latency possible in an environment with "maximum permitted latency" restrictions (or percentile targets, 99s etc). In all three problem domains, there is a clear inflection point at which multiple cores would overtake a single core in either: 1) Reducing the overall computation time. 2a|b) Serving a greater number of clients (or being able to perform more complex computation per request) before hitting maximum permitted latency limits. For PyParallel, I focused on 2a and 2b. More specifically, a TCP/IP socket server that had the ability to dynamically adjust its behavior (low latency vs concurrency vs throughput[1]), whilst maintaining optimal usage of underlying hardware[2]. That is: given sufficient load, you should be able to saturate all I/O channels (network and disk), or all cores, or both, with *useful* work. (The next step after saturation is sustained saturation (given sufficient load), which can be even harder to achieve, as you need to factor in latencies for "upcoming I/O" ahead of time if your computation is driven by the results of a disk read (or database cursor fetch).) (Sturla commented on the "import-DDoS" that you can run into on POSIX systems, which is a good example. You're saturating your underlying hardware, sure, but you're not doing useful work -- it's important to distinguish the two.) Dynamically adjusting behavior based on low latency vs concurrency vs throughput: [1]: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite... https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite... Optimal hardware use: [2]: https://speakerdeck.com/trent/parallelism-and-concurrency-with-python?slide=... So, with the focus of PyParallel established (socket server that could exploit all cores), my hypothesis was that I could find a new way of doing things that was more performant than the status quo. (In particular, I wanted to make sure I had an answer for "why not just use multiprocessing?" -- which is an important question.) https://speakerdeck.com/trent/parallelism-and-concurrency-with-python?slide=... So, I also made the decision to leverage threads for parallelism and not processes+IPC, which it sounds like you're leaning toward as well. Actually, other than the subinterpreter implementation aspect, everything you've described is basically on par with PyParallel, more or less. Now, going back to your original comment:
While perhaps a worthy goal, I don't know that it fits in well with my goals. I'm aiming for an improved multi-core story with a minimum of change in the interpreter.
That last sentence is very vague as multi-core means different things to different people. What is the problem domain you're going to try and initially target? Computationally-intensive parallel workloads like in 1), or the network I/O-driven socket server stuff like in 2a/2b? I'd argue it should be the latter. Reason being is that you'll rarely see the former problem tackled solely by pure Python -- e.g. Python may be gluing everything together, but the actual computation will be handled by something like NumPy/Numba/Fortran/Cython or custom C stuff, and, as Sturla's mentioned, OpenMP and MPI usually gets involved to manage the parallel aspect. For the I/O-driven socket server stuff, though, you already have this nice delineation of what would be run serially versus what would be ideal to run in parallel: import datrie import numpy as np import pyodbc import async from collections import defaultdict from async.http.server import ( router, make_routes, HttpServer, RangedRequest, ) # Tell PyParallel to invoke the tp_dealloc method explicitly # for these classes when rewinding a heap after a parallel # callback has finished. (Implementation detail: this toggles # the Py_TPFLAGS_PX_DEALLOC flag in the TypeObject's tp_flags; # when PyParallel intercepts PyObject_NEW/INIT (init_object), # classes (PyTypeObject *tp) with this flag set will be tracked # in a linked-list that is local to the parallel context being # used to service this client. When the context has its heaps # rewound back to the initial state at the time of the snapshot, # it will call tp_dealloc() explicitly against all objects of # this type that were encountered.) async.register_dealloc(pyodbc.Connection) async.register_dealloc(pyodbc.Cursor) async.register_dealloc(pyodbc.Row) # Load 29 million titles. RSS += ~9.5GB. TITLES = datrie.Trie.load('titles.trie') # Load 15 million 64-bit offsets. RSS += ~200MB. OFFSETS = np.load('offsets.npy') XML = 'enwiki-20150205-pages-articles.xml' class WikiServer(HttpServer): # All of these methods are automatically invoked in # parallel. HttpServer implements a data_received() # method which prepares the request object and then # calls the relevant method depending on the URL, e.g. # http://localhost/user/foo will call the user(request, # name='foo'). If we want to "write" to the client, # we return a bytes, bytearray or unicode object from # our callback (that is, we don't expose a socket.write() # to the user). # # Just before the PyParallel machinery invokes the # callback (via a simple PyObject_CallObject), though, # it takes a snapshot of its current state, such that # the exact state can be rolled back to (termed a socket # "rewind") when this callback is complete. If we don't # return a sendable object back, this rewind happens # immediately, and then we go straight into a read call. # If we do return something sendable, we send it. When # that send completes, *then* we do the rewind, then we # issue the next read/recv call. # # This approach is particularly well suited to parallel # callback execution because none of the objects we create # as part of the callback are needed when the callback # completes. No garbage can accumulate because nothing # can live longer than that callback. That obviates the # need for two things: reference counting against any object # in a parallel context, and garbage collection. Those # things are useful for the main thread, but not parallel # contexts. # # What if you do want to keep something around after # the callback? If it's a simple scalar type, the # following will work: # class Server: # name = None # @route # def set_name(self, request, name): # self.name = name.upper() # ^^^^^^^^^ we intercept that setattr and make # a copy of (the result of) name.upper() # using memory allocation from a different # heap that persists as long as the client # stays connnected. (There's actually # support for alternatively persisting # the entire heap that the object was # allocated from, which we could use if # we were persisting complex, external, # or container types where simply doing # a memcpy() of a *base + size_t wouldn't # be feasible. However, I haven't wired # up this logic to the socket context # logic yet.) # @route # def name(self, request): # return json_serialization(request, self.name) # ^^^^^^^^^ # This will return whatever # was set in the call above. # Once the client disconnects, # the value disappears. # # (Actually I think if you wanted to persist the object # for the lifetime of the server, you could probably # do `request.transport.parent.name = xyz`; or at least, # if that doesn't currently work, the required mechanics # definitely exist, so it would just need to be wired # up.) # # If you want to keep an object around past the lifetime of # the connected client and the server, then send it to the main # thread where it can be tracked like a normal Python object: # # USERS = async.dict() # ^^^^^^^^^^^^ shortcut for: # foo = {} # async.protect(foo) # or just: # foo = async.protect({}) # (On the backend, this instruments[3] the object such # that PyParallel can intercept setattr/setitem and # getattr/getitem calls and "do stuff"[4], depending # on the context.) [3]: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... [4]: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... # # class MyServer(HttpServer): # @route # ^^^^^^ Ignore the mechanics of this, it's just a helper # decorator I used to translate a HTTP GET for # /login/foo to a function call of `login(name='foo')`. # (see the bowls of async.http.server for details). # def login(self, request, name): # @call_from_main_thread # def _save_name(n): # USERS[n] = async.rdtsc() # return len(USERS) # count = _save_name(name) # return json_serialization(request, {'count': count}) # # The @call_from_main_thread decorator will enqueue a work # item to the main thread, and then wait on the main thread's # response. The main thread executes the callback and notifies # the parallel thread that the call has been completed and the # return value (in this case the value of `len(USERS)`). The # parallel thread resumes and finishes the client request. # Note that this will implicitly serialize execution; any number # of parallel requests can submit main thread work, but the # main thread can only call them one at a time. So, you'd # usually try and avoid this, or at least remove it from your # application's hot code path. connect_string = None all_users_sql = 'select * from user' one_user_sql = 'select * from user where login = ?' secret_key = None @route def wiki(self, request, name): # http://localhost/wiki/Python: name = Python if name not in TITLES: self.error(request, 404) # log(n) lookup against a trie with 29 million keys. offset = TITLES[name][0] # log(n) binary search against a numpy array with 15 # million int64s. ix = OFFSETS.searchsorted(offset, side='right') # OFFSETS[ix] = what's the offset after this? (start, end) = (ix-7, OFFSETS[ix]-11) # -7, +11 = adjust for the fact that all of the offsets # were calculated against the '<' of '<title>Foo</title>'. range_request = '%d-%d' % (start, end) request.range = RangedRequest(range_request) request.response.content_type = 'text/xml; charset=utf-8' return self.sendfile(request, XML) @route def users(self, request): # ODBC driver managers that implement connection pooling # behind the scenes play very nicely with our # pyodbc.connect() call here, returning a connection # from the pool (when able) without blocking. con = pyodbc.connect(self.connect_string) # The next three odbc calls would all block (in the # traditional sense), so this current thread would # not be able to serve any other requests whilst # waiting for completion -- however, this is far # less of a problem for PyParallel than single-threaded # land as other threads will keep servicing requests # in the mean time. (ODBC 3.8/SQL Server 2012/Windows 8 # did introduce async notification, such that we could # request that an event be set when the cursor/query/call # has completed, which we'd tie in to PyParallel by # submitting a threadpool wait (much like we do for async # DNS lookup[5], also added in Windows 8), however, it was # going to require a bit of modification to the pyodbc # module to support the async calling style, so, all the # calls stay synchronous for now.) [5]: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... cur = con.cursor() cur.execute(self.all_users_sql) return json_serialization(request, cur.fetchall()) @route def user(self, request, login): con = pyodbc.connect(self.connect_string) cur = con.cursor() cur.execute(self.one_user_sql, (login,)) return json_serialization(request, cur.fetchall()) @route def set_secret_key(self, request, key): # http://localhost/set_secret_key/foobar # An example of persisting a scalar for the lifetime # of the thread (that is, until it disconects or EOFs). try: self.secret_key = [ key, ] except ValueError: # This would be hit, because we've got guards in place # to assess the "clonability" of an object at this # point[6]. (Ok, after reviewing the code, we don't, # but at least we'd crash.) [6]: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... # However, this would work fine, essentially memcpy'ing # the key object at the time of assignment using a different # heap to the one that automatically gets reset at the end # of the callback. self.secret_key = key @route def secret_key(self, request): # http://localhost/secret_key -> 'foobar' return json_serialization(request, {'key': self.secret_key}) @route def stats(self, request): # Handy little json representation of various system stats; # active parallel contexts, I/O hogs, memory load, etc. stats = { 'system': dict(sys_stats()), 'server': dict(socket_stats(request.transport.parent)), 'memory': dict(memory_stats()), 'contexts': dict(context_stats()), 'elapsed': request.transport.elapsed(), 'thread': async.thread_seq_id(), } return json_serialization(request, stats) @route def debug(self, request): # Don't call print() or any of the sys.std(err|out) # methods in a parallel context. If you want to do some # poor man's debugging with print statements in lieu of not # being able to attach a pdb debugger (tracing is disabled # in parallel threads), then use async.debug(). (On # Windows, this writes the message to the debug stream, # which you'd monitor via dbgview or VS.) async.debug("received request: %s" % request.data) # Avoid repr() at the moment in parallel threads; it uses # PyThreadState_SetDictItem() to control recursion depths, # which I haven't made safe to call from a parallel context. # If you want to attach Visual Studio debugger at this point # though, you can do so via: async.debugbreak() # (That literally just generates an INT 3.) @route def shutdown(self, request): # Handy helper for server shutdown (stop listening on the # bound IP:PORT, wait for all running client callbacks to # complete, then return. Totally almost works at the # moment[7].) [7]: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... request.transport.shutdown() def main(): server = async.server('0.0.0.0', port) protocol = HttpServer protocol.connect_string = 'Driver={SQL Server}...' async.register(transport=server, protocol=protocol) ^^^^^^^^^^^^^^ this will create a special 'server' instance of the protocol, which will issue the bind() call. It then creates a configurable number (currently ncpu * 2) of parallel contexts and triggers parallel AcceptEx() invocation (you can prime "pre-accepted" sockets on Windows, which removes the serialization limits of accept() on POSIX). # If an exception occurs in a parallel thread, it is queued # to a special list the main thread has. The main thread # checks this list each time async.run_once() is called, so, # we call it here just to propagate any exceptions that # may have already occurred (like attempting to bind to an # invalid IP, or submitting a protocol that had an error). async.run_once() return server # (This also facilitates interactive console usage whilst # serving request in parallel.) if __name__ == '__main__': main() # Run forever. Returns when there are no active contexts # or ctrl-c is pressed. async.run() All of that works *today* with PyParallel. The main thread preps everything, does the importing, loads the huge data structures, establishes all the code objects and then, once async.run() is called, sits there dormant waiting for feedback from the parallel threads. It's not perfect; I haven't focused on clean shutdown yet, so you will 100% crash if you ctrl-C it currently. That's mainly an issue with interpreter finalization destroying the GIL, which clears our Py_MainThreadId, which makes all the instrumented macros like Py_INCREF/Py_DECREF think they're in a parallel context when they're not, which... well, you can probably guess what happens after that if you've got 8 threads still running at the time pointer dereferencing things that aren't what they think they are. None of the problems are showstoppers though, it's just a matter of prioritization and engineering effort. My strategic priorities to date have been: a) no changes to semantics of CPython API b) high performance c) real-world examples Now, given that this has been something I've mostly worked on in my own time, my tactical priority each development session (often started after an 8 hour work day where I'm operating at reduced brain power) is simply: a) forward progress at any cost The quickest hack I can think of that'll address the immediate problem is the one that gets implemented. That hack will last until it stops working, at which point, the quickest hack I can think of to replace it wins, and so on. At no time do I consider the maintainability, quality or portability of the hack -- as long as it moves the overall needle forward, perfect; it can be made elegant later. I think it's important to mention that, because if you're reviewing the source code, it helps explain things like how I implemented the persistence of an object within a client session (e.g. intercepting the setattr/setitem and doing the alternate heap memcpy dance alluded to above): https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... Without that bit of code, you'll leak memory, with it, you won't. I attacked pyodbc a few weeks ago -- it was also leaking memory when called from parallel callbacks because tp_dealloc wasn't being called on any of the Connection, Cursor or Row objects, so handles that were allocated (i.e. SQLAllocHandle()) were never paired with a SQLFreeHandle() (because we don't refcount in a parallel context, which means there's never a Py_DECREF that hits 0, which means Py_Dealloc() never gets called for that object (which works fine for everything that allocates via PyObject/PyMem facilities, because we intercept those and roll them back in bulk)), and thus, leak. Quickest fix I could think of at the time: async.register_dealloc(pyodbc.Connection) async.register_dealloc(pyodbc.Cursor) async.register_dealloc(pyodbc.Row) Which facilitates this during our interception of PyObject_NEW/INIT: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... Which allows us to do this for each heap... https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... ....that we encounter as part of "socket rewinding": https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... Absolutely horrendous hack from a software engineering perspective, but is surprisingly effective at solving the problem. Regards, Trent.
On Wed, Jun 24, 2015 at 9:59 PM, Trent Nelson <trent@snakebite.org> wrote:
(Sturla commented on the "import-DDoS" that you can run into on POSIX systems, which is a good example. You're saturating your underlying hardware, sure, but you're not doing useful work -- it's important to distinguish the two.)
To be clear, AFAIU the "import-DDoS" that supercomputers classically run into has nothing to do with POSIX, it has to do running systems that were designed for simulation workloads that go like: generate a bunch of data from scratch in memory, crunch on it for a while, and then spit out some summaries. So you end up with $1e11 spent on increasing the FLOP count, and the absolute minimum spent on the storage system -- basically just enough to let you load a single static binary into memory at the start of your computation, and there might even be some specific hacks in the linker to minimize cost of distributing that single binary load. (These are really weird architectures; they usually do not even have shared library support.) And the result is that when you try spinning up a Python program instead, the startup sequence produces (number of imports) * (number of entries in sys.path) * (hundreds of thousands of nodes) simultaneous stat calls hammering some poor NFS server somewhere and it falls over and dies. (I think often the network connection to the NFS server is not even using the ridiculously-fast interconnect mesh, but rather some plain-old-ethernet that gets saturated.) I could be wrong, I don't actually work with these systems myself, but that's what I've picked up. Continuing my vague and uninformed impressions, I suspect that this would actually be relatively easy to fix by hooking the import system to do something more intelligent, like nominate one node as the leader and have it do the file lookups and then tell everyone else what it found (via the existing message-passing systems). Though there is an interesting problem of how you bootstrap the hook code. But as to whether the new import hook stuff actually helps with this... I'm pretty sure most HPC centers haven't noticed that Python 3 exists yet. See above re: extremely weird architectures -- many of us are familiar with "clinging to RHEL 5" levels of conservatism, but that's nothing on "look there's only one person who ever knew how to get a working python and numpy using our bespoke compiler toolchain on this architecture that doesn't support extension module loading (!!), and they haven't touched it in years either"... There are lots of smart people working on this stuff right now. But they are starting from a pretty different place from those of us in the consumer computing world :-). -n -- Nathaniel J. Smith -- http://vorpus.org
Nathaniel Smith <njs@pobox.com> wrote:
Continuing my vague and uninformed impressions, I suspect that this would actually be relatively easy to fix by hooking the import system to do something more intelligent, like nominate one node as the leader and have it do the file lookups and then tell everyone else what it found (via the existing message-passing systems).
There are two known solutions. One is basically what you describe. The other, which at least works on IBM blue brain, is to import modules from a ramdisk. It seems to be sufficient to make sure whatever is serving the shared disk can deal with the 100k client DDoS. Sturla
On 25.06.2015 11:35, Sturla Molden wrote:
Nathaniel Smith <njs@pobox.com> wrote:
Continuing my vague and uninformed impressions, I suspect that this would actually be relatively easy to fix by hooking the import system to do something more intelligent, like nominate one node as the leader and have it do the file lookups and then tell everyone else what it found (via the existing message-passing systems).
There are two known solutions. One is basically what you describe. The other, which at least works on IBM blue brain, is to import modules from a ramdisk. It seems to be sufficient to make sure whatever is serving the shared disk can deal with the 100k client DDoS.
Another way to solve this problem may be to use our eGenix PyRun which embeds modules right in the binary. As a result, all reading is done from the mmap'ed binary and automatically shared between processes by the OS: http://www.egenix.com/products/python/PyRun/ I don't know whether this actually works on an IBM Blue Brain with 100k clients - we are not fortunate enough to have access to one of those machines :-) Note: Even though the data reading is shared, the resulting code and modules objects are, of course, not shared, so you still have the overhead of using up memory for this, unless you init your process cluster using fork() after you've imported all necessary modules (then you benefit from the copy-on-write provided by the OS - code objects usually don't change after they have been created). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 25 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-06-25: Released mxODBC 3.3.3 ... http://egenix.com/go79 2015-06-16: Released eGenix pyOpenSSL 0.13.10 ... http://egenix.com/go78 2015-07-20: EuroPython 2015, Bilbao, Spain ... 25 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 21 June 2015 at 07:42, Eric Snow <ericsnowcurrently@gmail.com> wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This proposal is meant to be a shot over the bow, so to speak. I plan on putting together a more complete PEP some time in the future, with content that is more refined along with references to the appropriate online resources.
Feedback appreciated! Offers to help even more so! :)
For folks interested in more of the background and design trade-offs involved here, with Eric's initial post published, I've now extracted and updated my old answer about the GIL from the Python 3 Q & A page, and turned it into its own article: http://python-notes.curiousefficiency.org/en/latest/python3/multicore_python... Cheers, Nick. P.S. The entry for the old Q&A answer is still there, but now redirects to the new article: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_an... -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
First, a minor question: instead of banning fork entirely within subinterpreters, why not just document that it is illegal to do anything between fork and exec in a subinterpreters, except for a very small (but possibly extensible) subset of Python? For example, after fork, you can no longer access any channels, and you also can't use signals, threads, fork again, imports, assignments to builtins, raising exceptions, or a whole host of other things (but of course if you exec an entirely new Python interpreter, it can do any of those things). C extension modules could just have a flag that marks whether the whole module is fork-safe or not (defaulting to not). So, this allows a subinterpreter to use subprocess (or even multiprocessing, as long as you use the forkserver or spawn mechanism), and it gives code that intentionally wants to do tricky/dangerous things a way to do them, but it avoids all of the problems with accidentally breaking a subinterpreter by forking it and then doing bad things. Second, a major question: In this proposal, are builtins and the modules map shared, or copied? If they're copied, it seems like it would be hard to do that even as efficiently as multiprocessing, much less more efficiently. Of course you could fake this with CoW, but I'm not sure how you'd do that, short of CoWing the entire heap (by using clone instead of pthreads on Linux, or by doing a bunch of explicit mmap and related calls on other POSIX systems), at which point you're pretty close to just implementing fork or vfork yourself to avoid calling fork or vfork, and unlikely to get it as efficient or as robust as what's already there. If they're shared, on the other hand, then it seems like it becomes very difficult to implement subinterpreter-safe code, because it's no longer safe to import a module, set a flag, call a registration function, etc.
On Sun, 21 Jun 2015 14:08:09 -0700 Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
First, a minor question: instead of banning fork entirely within subinterpreters, why not just document that it is illegal to do anything between fork and exec in a subinterpreters, except for a very small (but possibly extensible) subset of Python?
It's actually already the case in POSIX that most things are illegal between fork() and exec(). However, to make fork() practical, many libraries or frameworks tend to ignore those problems deliberately. Regards Antoine.
On Sun, Jun 21, 2015, at 18:41, Antoine Pitrou wrote:
It's actually already the case in POSIX that most things are illegal between fork() and exec(). However, to make fork() practical, many libraries or frameworks tend to ignore those problems deliberately.
I'm not _entirely_ sure that this applies to single-threaded programs, or even to multi-threaded programs that don't use constructs that will cause problems. The text is: "A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called. Fork handlers may be established by means of the pthread_atfork() function in order to maintain application invariants across fork() calls." Note that it uses "may only" (which is ambiguous) rather than "shall only". It could be read that "only [stuff] until exec" is a suggestion of what the child process "may" do, under the circumstances described, to avoid the particular problems being discussed, rather than as a general prohibition. And the next paragraph is "When the application calls fork() from a signal handler and any of the fork handlers registered by pthread_atfork() calls a function that is not async-signal-safe, the behavior is undefined." suggesting that the behavior is _not_ likewise undefined when it was not called from a signal handler. Now, *vfork* is a ridiculous can of worms, which is why nobody uses it anymore, and certainly not within Python.
On Mon, Jun 22, 2015 at 10:56 PM, <random832@fastmail.us> wrote:
On Sun, Jun 21, 2015, at 18:41, Antoine Pitrou wrote:
It's actually already the case in POSIX that most things are illegal between fork() and exec(). However, to make fork() practical, many libraries or frameworks tend to ignore those problems deliberately.
I'm not _entirely_ sure that this applies to single-threaded programs, or even to multi-threaded programs that don't use constructs that will cause problems.
The text is: "A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called. Fork handlers may be established by means of the pthread_atfork() function in order to maintain application invariants across fork() calls."
Note that it uses "may only" (which is ambiguous) rather than "shall only". It could be read that "only [stuff] until exec" is a suggestion of what the child process "may" do, under the circumstances described, to avoid the particular problems being discussed, rather than as a general prohibition.
Yeah, basically the way this works out is: (a) in practice on mainstream systems you can get away with forking and then doing whatever, so long as none of the threads in the parent process were holding any crucial locks, and the child is prepared for them to have all disappeared. (b) But, if something does break, then system builders reserve the right to laugh in your face. You can argue about things being technically ambiguous or whatever, but that's how it works. E.g. if you have a single-threaded program that does a matrix multiply, then forks, and then the child does a matrix multiply, and you run it on OS X linked to Apple's standard libraries, then the child will lock up, and if you report this to Apple they will close it as not-a-bug.
And the next paragraph is "When the application calls fork() from a signal handler and any of the fork handlers registered by pthread_atfork() calls a function that is not async-signal-safe, the behavior is undefined." suggesting that the behavior is _not_ likewise undefined when it was not called from a signal handler.
I wouldn't read anything into this. pthread_atfork registers three handlers, and two of them are run in the parent process, where normally they'd be allowed to call any functions they like. -n -- Nathaniel J. Smith -- http://vorpus.org
On Sun, Jun 21, 2015 at 3:08 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
First, a minor question: instead of banning fork entirely within subinterpreters, why not just document that it is illegal to do anything between fork and exec in a subinterpreters, except for a very small (but possibly extensible) subset of Python? For example, after fork, you can no longer access any channels, and you also can't use signals, threads, fork again, imports, assignments to builtins, raising exceptions, or a whole host of other things (but of course if you exec an entirely new Python interpreter, it can do any of those things).
Sure. I expect the quickest approach, though, will be to initially have blanket restrictions and then ease them once the core functionality is complete.
C extension modules could just have a flag that marks whether the whole module is fork-safe or not (defaulting to not).
That may make sense independently from my proposal.
So, this allows a subinterpreter to use subprocess (or even multiprocessing, as long as you use the forkserver or spawn mechanism), and it gives code that intentionally wants to do tricky/dangerous things a way to do them, but it avoids all of the problems with accidentally breaking a subinterpreter by forking it and then doing bad things.
Second, a major question: In this proposal, are builtins and the modules map shared, or copied?
If they're copied, it seems like it would be hard to do that even as efficiently as multiprocessing, much less more efficiently. Of course you could fake this with CoW, but I'm not sure how you'd do that, short of CoWing the entire heap (by using clone instead of pthreads on Linux, or by doing a bunch of explicit mmap and related calls on other POSIX systems), at which point you're pretty close to just implementing fork or vfork yourself to avoid calling fork or vfork, and unlikely to get it as efficient or as robust as what's already there.
If they're shared, on the other hand, then it seems like it becomes very difficult to implement subinterpreter-safe code, because it's no longer safe to import a module, set a flag, call a registration function, etc.
I expect that ultimately the builtins will be shared in some fashion. To some extent they already are. sys.modules (and the rest of the import machinery) will mostly not be shared, though I expect that likewise we will have some form of sharing where we can get away with it. -eric
On 21 June 2015 at 07:42, Eric Snow <ericsnowcurrently@gmail.com> wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This proposal is meant to be a shot over the bow, so to speak. I plan on putting together a more complete PEP some time in the future, with content that is more refined along with references to the appropriate online resources.
Feedback appreciated! Offers to help even more so! :)
It occurred to me in the context of another conversation that you (or someone else!) may be able to prototype some of the public API ideas for this using Jython and Vert.x: http://vertx.io/ That idea and some of the initial feedback in this thread also made me realise that it is going to be essential to keep in mind that there are key goals at two different layers here: * design a compelling implementation independent public API for CSP style programming in Python * use subinterpreters to implement that API efficiently in CPython There's a feedback loop between those two goals where limitations on what's feasible in CPython may constrain the design of the public API, and the design of the API may drive enhancements to the existing subinterpreter capability, but we shouldn't lose sight of the fact that they're *separate* goals. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, Jun 21, 2015 at 7:47 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
It occurred to me in the context of another conversation that you (or someone else!) may be able to prototype some of the public API ideas for this using Jython and Vert.x: http://vertx.io/
I'll take a look.
That idea and some of the initial feedback in this thread also made me realise that it is going to be essential to keep in mind that there are key goals at two different layers here:
* design a compelling implementation independent public API for CSP style programming in Python * use subinterpreters to implement that API efficiently in CPython
There's a feedback loop between those two goals where limitations on what's feasible in CPython may constrain the design of the public API, and the design of the API may drive enhancements to the existing subinterpreter capability, but we shouldn't lose sight of the fact that they're *separate* goals.
Yep. I've looked at it that way from the beginning. When I get to the point of writing an actual PEP, I'm thinking it will actually be multiple PEPs covering the different pieces. I've also been considering how to implement that high-level API in terms of a low-level API (threading vs. _thread) and it it make sense to focus less on subinterpreters in that context. At this point it makes sense to me to expose subinterpreters in Python, so for now I was planning on that for the low-level API. -eric
On Wed, Jun 24, 2015 at 2:01 AM Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Sun, Jun 21, 2015 at 7:47 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
It occurred to me in the context of another conversation that you (or someone else!) may be able to prototype some of the public API ideas for this using Jython and Vert.x: http://vertx.io/
I'll take a look.
Note that Vert.x 3 was just released today, which (at least for now) drops support for Python. There is work underway to support it under version 3, but it's using CPython and Py4J, not Jython. You'd need to use Vert.x 2 to get Jython support: http://vertx.io/vertx2
Eric Snow wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
This proposal is meant to be a shot over the bow, so to speak. I plan on putting together a more complete PEP some time in the future, with content that is more refined along with references to the appropriate online resources.
I've heard very little about this since the original announcement (https://lwn.net/Articles/650521/), so I was wondering if this is still an active idea being worked on? Or has it turned out to be too difficult? Sebastian
On Fri, Jun 3, 2016 at 10:06 AM, Sebastian Krause <sebastian@realpath.org> wrote:
Eric Snow wrote:
tl;dr Let's exploit multiple cores by fixing up subinterpreters, exposing them in Python, and adding a mechanism to safely share objects between them.
I've heard very little about this since the original announcement (https://lwn.net/Articles/650521/), so I was wondering if this is still an active idea being worked on? Or has it turned out to be too difficult?
Sorry for the lack of communication. I tabled the project a while back due to lack of time. I'm still planning on writing something up in the near future on where things are at, what's left, and what good ideas may come out of this regardless. -eric
participants (26)
-
Andrew Barnert
-
Antoine Pitrou
-
Barry Warsaw
-
Chris Angelico
-
Dan O'Reilly
-
Devin Jeanpierre
-
Eric Snow
-
Ethan Furman
-
Gregory P. Smith
-
Jonas Wielicki
-
M.-A. Lemburg
-
Nathaniel Smith
-
Nick Coghlan
-
Oleg Broytman
-
Oscar Benjamin
-
random832@fastmail.us
-
Ron Adam
-
Rustom Mody
-
Sebastian Krause
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Sturla Molden
-
Trent Nelson
-
Wes Turner
-
Yury Selivanov