Should we be making so many changes in pursuit of PEP 554?
Hi, There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process. These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn. While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it. What are sub-interpreters? -------------------------- A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers. How can they be implemented to support parallelism? --------------------------------------------------- There are two obvious options. a) Many sub-interpreters in a single O/S process. I will call this the many-to-one model (many interpreters in one O/S process). b) One sub-interpreter per O/S process. This is what we currently have for multiprocessing. I will call this the one-to-one model (one interpreter in one O/S process). There seems to be an assumption amongst those working on PEP 554 that the many-to-one model is the only way to support sub-interpreters that can execute in parallel. This isn't true. The one-to-one model has many advantages. Advantages of the one-to-one model ---------------------------------- 1. It's less bug prone. It is much easier to reason about code working in a single address space. Most code assumes 2. It's more secure. Separate O/S processes provide a much stronger boundary between interpreters. This is why some browsers use separate processes for browser tabs. 3. It can be implemented on top of the multiprocessing module, for testing. A more efficient implementation can be developed once sub-interpreters prove useful. 4. The required changes should have no negative performance impact. 5. Third party modules should continue to work as they do now. 6. It takes much less work :) Performance ----------- Creating O/S processes is usually considered to be slow. Whilst processes are undoubtedly slower to create than threads, the absolute time to create a process is small; well under 1ms on linux. Creating a new sub-interpreter typically requires importing quite a few modules before any useful work can be done. The time spent doing these imports will dominate the time to create an O/S process or thread. If sub-interpreters are to be used for parallelism, there is no need to have many more sub-interpreters than CPU cores, so the overhead should be small. For additional concurrency, threads or coroutines can be used. The one-to-one model is faster as it uses the hardware for interpreter separation, whereas the many-to-one model must use software. Process separation by the hardware virtual memory system has zero cost. Separation done in software needs extra memory reads when doing allocation or deallocation. Overall, for any interpreter that runs for a second or more, it is likely that the one-to-one model would be faster. Timings of multiprocessing & threads on my machine (6-core 2019 laptop) ----------------------------------------------------------------------- #Threads def foo(): pass def spawn_and_join(count): threads = [ Thread(target=foo, args=()) for _ in range(count) ] for t in threads: t.start() for t in threads: t.join() spawn_and_join(1000) # Processes def spawn_and_join(count): processes = [ Process(target=foo, args=()) for _ in range(count) ] for p in processes: p.start() for p in processes: p.join() spawn_and_join(1000) Wall clock time for threads: 86ms. Less than 0.1ms per thread. Wall clock time for processes: 370ms. Less than 0.4ms per process. Processes are slower, but plenty fast enough. Cheers, Mark.
Hi,
There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process.
These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn.
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
What are sub-interpreters? --------------------------
A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers.
How can they be implemented to support parallelism? ---------------------------------------------------
There are two obvious options. a) Many sub-interpreters in a single O/S process. I will call this the many-to-one model (many interpreters in one O/S process). b) One sub-interpreter per O/S process. This is what we currently have for multiprocessing. I will call this the one-to-one model (one interpreter in one O/S process).
There seems to be an assumption amongst those working on PEP 554 that the many-to-one model is the only way to support sub-interpreters that can execute in parallel. This isn't true. The one-to-one model has many advantages.
Advantages of the one-to-one model ----------------------------------
1. It's less bug prone. It is much easier to reason about code working in a single address space. Most code assumes
I'm curious where reasoning about address spaces comes into writing Python code? I can't say that address space has ever been a concern to me when coding in Python.
2. It's more secure. Separate O/S processes provide a much stronger boundary between interpreters. This is why some browsers use separate processes for browser tabs.
3. It can be implemented on top of the multiprocessing module, for testing. A more efficient implementation can be developed once sub-interpreters prove useful.
4. The required changes should have no negative performance impact.
5. Third party modules should continue to work as they do now.
6. It takes much less work :)
Performance -----------
Creating O/S processes is usually considered to be slow. Whilst processes are undoubtedly slower to create than threads, the absolute time to create a process is small; well under 1ms on linux.
Creating a new sub-interpreter typically requires importing quite a few modules before any useful work can be done. The time spent doing these imports will dominate the time to create an O/S process or thread. If sub-interpreters are to be used for parallelism, there is no need to have many more sub-interpreters than CPU cores, so the overhead should be small. For additional concurrency, threads or coroutines can be used.
The one-to-one model is faster as it uses the hardware for interpreter separation, whereas the many-to-one model must use software. Process separation by the hardware virtual memory system has zero cost. Separation done in software needs extra memory reads when doing allocation or deallocation.
Overall, for any interpreter that runs for a second or more, it is likely that the one-to-one model would be faster.
Timings of multiprocessing & threads on my machine (6-core 2019 laptop) -----------------------------------------------------------------------
#Threads
def foo(): pass
def spawn_and_join(count): threads = [ Thread(target=foo, args=()) for _ in range(count) ] for t in threads: t.start() for t in threads: t.join()
spawn_and_join(1000)
# Processes
def spawn_and_join(count): processes = [ Process(target=foo, args=()) for _ in range(count) ] for p in processes: p.start() for p in processes: p.join()
spawn_and_join(1000)
Wall clock time for threads: 86ms. Less than 0.1ms per thread.
Wall clock time for processes: 370ms. Less than 0.4ms per process.
Processes are slower, but plenty fast enough.
Cheers, Mark.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python- dev@python.org/message/5YNWDIYECDQDYQ7IFYJS6K5HUDUAWTT6/ Code of Conduct: http://python.org/psf/codeofconduct/
On 6/5/2020 11:11 AM, Edwin Zimmerman wrote:
Advantages of the one-to-one model ----------------------------------
1. It's less bug prone. It is much easier to reason about code working in a single address space. Most code assumes I'm curious where reasoning about address spaces comes into writing Python code? I can't say that address space has ever been a concern to me when coding in Python.
I don't know enough about Python code with subinterpreters to comment there. But for the C code that makes up much of CPython: it's very difficult to inspect code and know you aren't accidentally sharing objects between interpreters. Eric
On 06/05/2020 07:32 AM, Mark Shannon wrote:
3. It can be implemented on top of the multiprocessing module, for testing. A more efficient implementation can be developed once sub-interpreters prove useful.
Isn't part of the impetus for in-process sub-interpreters the Python-embedded-in-language-X use-case? Isn't multiprocessing a poor solution then? -- ~Ethan~
On 2020-06-05 16:32, Mark Shannon wrote:
Hi,
There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process.
These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn.
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
What are sub-interpreters? --------------------------
A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers.
Here's my biased take on the subject: Interpreters are contexts in which Python runs. They contain configuration (e.g. the import path) and runtime state (e.g. the set of imported modules). An interpreter is created at Python startup (Py_InitializeEx), and you can create/destroy additional ones with Py_NewInterpreter/Py_EndInterpreter. This is long-standing API that is used, most notably by mod_wsgi. Many extension modules and some stdlib modules don't play well with the existence of multiple interpreters in a process, mainly because they use process-global state (C static variables) rather than some more granular scope. This tends to result in nasty bugs (C-level crashes) when multiple interpreters are started in parallel (Py_NewInterpreter) or in sequence (several Py_InitializeEx/Py_FinalizeEx cycles). The bugs are similar in both cases. Whether Python interpreters run sequentially or in parallel, having them work will enable a use case I would like to see: allowing me to call Python code from wherever I want, without thinking about global state. Think calling Python from an utility library that doesn't care about the rest of the application it's used in. I personally call this "the Lua use case", because light-weight, worry-free embedding is an area where Python loses to Lua. (And JS as well—that's a relatively recent development, but much more worrying.) The part I have been involved in is moving away from process-global state. Process-global state can be made to work, but it is much safer to always default to module-local state (roughly what Python-language's `global` means), and treat process-global state as exceptions one has to think through. The API introduced in PEPs 384, 489, 573 (and future planned ones) aims to make module-local state possible to use, then later easy to use, and the natural default. Relatively recently, there is an effort to expose interpreter creation & finalization from Python code, and also to allow communication between them (starting with something rudimentary, sharing buffers). There is also a push to explore making the GIL per-interpreter, which ties in to moving away from process-global state. Both are interesting ideas, but (like banishing global state) not the whole motivation for changes/additions. It's probably possible to do similar things with threads or subprocesses, sure, but if these efforts went away, the other issues would remain. I am not too fond of the term "sub-interpreters", because it implies some kind of hierarchy. Of course, if interpreter creation is exposed to Python, you need some kind of "parent" to start the "child" and get its result when done. Also, due to some practical issues you might (sadly, currently) need some notion of "the main interpreter". But ideally, we can make interpreters entirely independent to allow the "Lua use case". In the end-game of these efforts, I see Py_NewInterpreter transparently calling Py_InitializeEx if global state isn't set up yet, and similarly, Py_EndInterpreter turning the lights off if it's the last one out.
Petr, thanks for clearly stating your interests and goals for subinterpreters. This lays to rest some of my own fears. I am still skeptical that (even after the GIL is separated) they will enable multi-core in ways that multiple processes couldn't handle just as well or better, but your clear statement that *embedding* is the more important use case helps me feel supportive of the concept. On Tue, Jun 9, 2020 at 6:26 AM Petr Viktorin <encukou@gmail.com> wrote:
On 2020-06-05 16:32, Mark Shannon wrote:
Hi,
There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process.
These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn.
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
What are sub-interpreters? --------------------------
A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers.
Here's my biased take on the subject:
Interpreters are contexts in which Python runs. They contain configuration (e.g. the import path) and runtime state (e.g. the set of imported modules). An interpreter is created at Python startup (Py_InitializeEx), and you can create/destroy additional ones with Py_NewInterpreter/Py_EndInterpreter. This is long-standing API that is used, most notably by mod_wsgi.
Many extension modules and some stdlib modules don't play well with the existence of multiple interpreters in a process, mainly because they use process-global state (C static variables) rather than some more granular scope. This tends to result in nasty bugs (C-level crashes) when multiple interpreters are started in parallel (Py_NewInterpreter) or in sequence (several Py_InitializeEx/Py_FinalizeEx cycles). The bugs are similar in both cases.
Whether Python interpreters run sequentially or in parallel, having them work will enable a use case I would like to see: allowing me to call Python code from wherever I want, without thinking about global state. Think calling Python from an utility library that doesn't care about the rest of the application it's used in. I personally call this "the Lua use case", because light-weight, worry-free embedding is an area where Python loses to Lua. (And JS as well—that's a relatively recent development, but much more worrying.)
The part I have been involved in is moving away from process-global state. Process-global state can be made to work, but it is much safer to always default to module-local state (roughly what Python-language's `global` means), and treat process-global state as exceptions one has to think through. The API introduced in PEPs 384, 489, 573 (and future planned ones) aims to make module-local state possible to use, then later easy to use, and the natural default.
Relatively recently, there is an effort to expose interpreter creation & finalization from Python code, and also to allow communication between them (starting with something rudimentary, sharing buffers). There is also a push to explore making the GIL per-interpreter, which ties in to moving away from process-global state. Both are interesting ideas, but (like banishing global state) not the whole motivation for changes/additions. It's probably possible to do similar things with threads or subprocesses, sure, but if these efforts went away, the other issues would remain.
I am not too fond of the term "sub-interpreters", because it implies some kind of hierarchy. Of course, if interpreter creation is exposed to Python, you need some kind of "parent" to start the "child" and get its result when done. Also, due to some practical issues you might (sadly, currently) need some notion of "the main interpreter". But ideally, we can make interpreters entirely independent to allow the "Lua use case". In the end-game of these efforts, I see Py_NewInterpreter transparently calling Py_InitializeEx if global state isn't set up yet, and similarly, Py_EndInterpreter turning the lights off if it's the last one out. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NLITVUIZ... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Tue, Jun 9, 2020 at 10:28 PM Petr Viktorin <encukou@gmail.com> wrote:
Relatively recently, there is an effort to expose interpreter creation & finalization from Python code, and also to allow communication between them (starting with something rudimentary, sharing buffers). There is also a push to explore making the GIL per-interpreter, which ties in to moving away from process-global state. Both are interesting ideas, but (like banishing global state) not the whole motivation for changes/additions.
Some changes for per interpreter GIL doesn't help sub interpreters so much. For example, isolating memory allocator including free list and constants between sub interpreter makes sub interpreter fatter. I assume Mark is talking about such changes. Now Victor proposing move dict free list per interpreter state and the code looks good to me. This is a change for per interpreter GIL, but not for sub interpreters. https://github.com/python/cpython/pull/20645 Should we commit this change to the master branch? Or should we create another branch for such changes? Regards, -- Inada Naoki <songofacandy@gmail.com>
On 2020-06-10 04:43, Inada Naoki wrote:
On Tue, Jun 9, 2020 at 10:28 PM Petr Viktorin <encukou@gmail.com> wrote:
Relatively recently, there is an effort to expose interpreter creation & finalization from Python code, and also to allow communication between them (starting with something rudimentary, sharing buffers). There is also a push to explore making the GIL per-interpreter, which ties in to moving away from process-global state. Both are interesting ideas, but (like banishing global state) not the whole motivation for changes/additions.
Some changes for per interpreter GIL doesn't help sub interpreters so much. For example, isolating memory allocator including free list and constants between sub interpreter makes sub interpreter fatter. I assume Mark is talking about such changes.
Now Victor proposing move dict free list per interpreter state and the code looks good to me. This is a change for per interpreter GIL, but not for sub interpreters. https://github.com/python/cpython/pull/20645
Should we commit this change to the master branch? Or should we create another branch for such changes?
I think that most of all, the changes aimed at breaking up the GIL need a PEP, so that everyone knows what the changes are actually about -- and especially so that everyone knows the changes are happening. Note that neither PEP 554 (which itself isn't accepted yet) nor PEP 573 is related to breaking up the GIL.
Hi, I agree that embedding Python is an important use case and that we should try to leak less memory and better isolate multiple interpreters for this use case. There are multiple projects to enhance code to make it work better with multiple interpreters: * convert C extension modules to multiphase initialization (PEP 489) * move C extension module global variables (static ...) into a module state * convert static types to heap types * make free lists per interpreter * etc. From what I saw, the first side effect is that "suddenly", tests using subinterpreters start to report new reference leaks. Examples of issues and fixes: * https://github.com/python/cpython/commit/18a90248fdd92b27098cc4db773686a2d10...: reference leak in the init function of the select module * https://github.com/python/cpython/commit/310e2d25170a88ef03f6fd31efcc899fe06...: reference cycles with encodings and _testcapi misuses PyModule_AddObject() * https://bugs.python.org/issue40050: _weakref and importlib * etc. In fact, none of these bugs is not new. I checked for a few: bugs were always there. It's just that previously, nobody paid attention to these leaks. Fixing subinterpreters helps to leak less memory even for the single interpreter (embed Python) use case. The problem is that Python never tried to clear everything at exit. One way to see the issue is the number of references at exit using a debug build, on the up-to-date master branch: $ ./python -X showrefcount -c pass [18645 refs, 6141 blocks] Python leaks 18,645 references at exit. Some of the work that I listed is tracked by https://bugs.python.org/issue1635741 which was created in 2007: "Py_Finalize() doesn't clear all Python objects at exit". Another way to see the issue is: $ PYTHONMALLOC=malloc valgrind ./python -c pass (...) ==169747== LEAK SUMMARY: ==169747== definitely lost: 48 bytes in 2 blocks ==169747== indirectly lost: 136 bytes in 6 blocks ==169747== possibly lost: 700,552 bytes in 5,677 blocks ==169747== still reachable: 5,450 bytes in 48 blocks ==169747== suppressed: 0 bytes in 0 blocks Python leaks around 700 KB at exit. Even if you ignore the "run multiple interpreters in parallel" and PEP 554 use cases, enhancing code to better work with subinterpreters also makes Python a better library to embed in applications and so is useful. Victor Le mer. 10 juin 2020 à 04:46, Inada Naoki <songofacandy@gmail.com> a écrit :
On Tue, Jun 9, 2020 at 10:28 PM Petr Viktorin <encukou@gmail.com> wrote:
Relatively recently, there is an effort to expose interpreter creation & finalization from Python code, and also to allow communication between them (starting with something rudimentary, sharing buffers). There is also a push to explore making the GIL per-interpreter, which ties in to moving away from process-global state. Both are interesting ideas, but (like banishing global state) not the whole motivation for changes/additions.
Some changes for per interpreter GIL doesn't help sub interpreters so much. For example, isolating memory allocator including free list and constants between sub interpreter makes sub interpreter fatter. I assume Mark is talking about such changes.
Now Victor proposing move dict free list per interpreter state and the code looks good to me. This is a change for per interpreter GIL, but not for sub interpreters. https://github.com/python/cpython/pull/20645
Should we commit this change to the master branch? Or should we create another branch for such changes?
Regards, -- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/L7JRFJLD... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
In fairness, if the process is really exiting, the OS should clear that out. Even if it is embedded, the embedding process could just release (or zero out) the entire memory allocation. I personally like plugging those leaks, but it does feel like putting purity over practicality.
Hi Petr, On 09/06/2020 2:24 pm, Petr Viktorin wrote:
On 2020-06-05 16:32, Mark Shannon wrote:
Hi,
There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process.
These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn.
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
What are sub-interpreters? --------------------------
A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers.
Here's my biased take on the subject:
Interpreters are contexts in which Python runs. They contain configuration (e.g. the import path) and runtime state (e.g. the set of imported modules). An interpreter is created at Python startup (Py_InitializeEx), and you can create/destroy additional ones with Py_NewInterpreter/Py_EndInterpreter. This is long-standing API that is used, most notably by mod_wsgi.
Many extension modules and some stdlib modules don't play well with the existence of multiple interpreters in a process, mainly because they use process-global state (C static variables) rather than some more granular scope. This tends to result in nasty bugs (C-level crashes) when multiple interpreters are started in parallel (Py_NewInterpreter) or in sequence (several Py_InitializeEx/Py_FinalizeEx cycles). The bugs are similar in both cases.
Whether Python interpreters run sequentially or in parallel, having them work will enable a use case I would like to see: allowing me to call Python code from wherever I want, without thinking about global state. Think calling Python from an utility library that doesn't care about the rest of the application it's used in. I personally call this "the Lua use case", because light-weight, worry-free embedding is an area where Python loses to Lua. (And JS as well—that's a relatively recent development, but much more worrying.)
This seems like a worthwhile goal. However I don't see why this requires having multiple Python interpreters in a single O/S process.
The part I have been involved in is moving away from process-global state. Process-global state can be made to work, but it is much safer to always default to module-local state (roughly what Python-language's `global` means), and treat process-global state as exceptions one has to think through. The API introduced in PEPs 384, 489, 573 (and future planned ones) aims to make module-local state possible to use, then later easy to use, and the natural default.
I don't agree. Process level state is *much* safer than module-local state. Suppose two interpreters, have both imported the same module. By using O/S processes to keep the interpreters separate, the hardware prevents the two copies of the module from interfering with each other. By sharing an address space the separation is maintained by trust and hoping that third party modules don't have too many bugs. I don't see how you can claim the later case if safer.
Relatively recently, there is an effort to expose interpreter creation & finalization from Python code, and also to allow communication between them (starting with something rudimentary, sharing buffers). There is also a push to explore making the GIL per-interpreter, which ties in to moving away from process-global state. Both are interesting ideas, but (like banishing global state) not the whole motivation for changes/additions. It's probably possible to do similar things with threads or subprocesses, sure, but if these efforts went away, the other issues would remain.
What other issues? Please be specific.
I am not too fond of the term "sub-interpreters", because it implies some kind of hierarchy. Of course, if interpreter creation is exposed to Python, you need some kind of "parent" to start the "child" and get its result when done. Also, due to some practical issues you might (sadly, currently) need some notion of "the main interpreter". But ideally, we can make interpreters entirely independent to allow the "Lua use case". In the end-game of these efforts, I see Py_NewInterpreter transparently calling Py_InitializeEx if global state isn't set up yet, and similarly, Py_EndInterpreter turning the lights off if it's the last one out.
I'll drop the "sub" from now on :) If each interpreter runs in its own process, then initializing an interpreter and initializing the "global" state are the same thing and wouldn't need a separate step. Cheers, Mark.
On Wed, Jun 10, 2020 at 5:37 AM Mark Shannon <mark@hotpy.org> wrote:
By sharing an address space the separation is maintained by trust and hoping that third party modules don't have too many bugs.
By definition, the use of any third-party module (or even the standard library itself) is by trust and the hope that they don't have too many bugs. Sure, this creates a potential new class of bugs, for those who use it, while also offering the chance to find and fix old bugs like Victor found. Mostly, though, it exposes lots of bad practices that people could mostly get away with as long as the assumption was that everything would always be single-threaded, single-process, and the entire software industry is moving away from those assumptions, so it's only logical that Python takes advantage of that shift instead of becoming another legacy language. In the meantime, modules can explicitly label themselves as single-interpreter only, requiring multiprocessing instead of threading or embedding to work correctly. Modules were more than happy to label themselves as 2.x only for a decade -Em
On 6/10/2020 8:33 AM, Mark Shannon wrote:
Hi Petr,
On 09/06/2020 2:24 pm, Petr Viktorin wrote:
On 2020-06-05 16:32, Mark Shannon wrote:
Whether Python interpreters run sequentially or in parallel, having them work will enable a use case I would like to see: allowing me to call Python code from wherever I want, without thinking about global state. Think calling Python from an utility library that doesn't care about the rest of the application it's used in. I personally call this "the Lua use case", because light-weight, worry-free embedding is an area where Python loses to Lua. (And JS as well—that's a relatively recent development, but much more worrying.)
This seems like a worthwhile goal. However I don't see why this requires having multiple Python interpreters in a single O/S process.
I assume it would be so that my code could link with library A, which embeds Python, and library B, which also embeds Python. A and B have no knowledge of each other.
The part I have been involved in is moving away from process-global state. Process-global state can be made to work, but it is much safer to always default to module-local state (roughly what Python-language's `global` means), and treat process-global state as exceptions one has to think through. The API introduced in PEPs 384, 489, 573 (and future planned ones) aims to make module-local state possible to use, then later easy to use, and the natural default.
I don't agree. Process level state is *much* safer than module-local state.
Suppose two interpreters, have both imported the same module. By using O/S processes to keep the interpreters separate, the hardware prevents the two copies of the module from interfering with each other. By sharing an address space the separation is maintained by trust and hoping that third party modules don't have too many bugs.
I don't see how you can claim the later case if safer.
I've always assumed that per-module state meant per-module, per-interpreter. Maybe I've misunderstood, in which case I agree with Mark. If per-module state isn't isolated per interpreter, that sort of kills the multiple interpreter model, in my mind. Eric
Eric V. Smith wrote:
Hi Petr, On 09/06/2020 2:24 pm, Petr Viktorin wrote: On 2020-06-05 16:32, Mark Shannon wrote: Whether Python interpreters run sequentially or in parallel, having them work will enable a use case I would like to see: allowing me to call Python code from wherever I want, without thinking about global state. Think calling Python from an utility library that doesn't care about the rest of the application it's used in. I personally call this "the Lua use case", because light-weight, worry-free embedding is an area where Python loses to Lua. (And JS as well—that's a relatively recent development, but much more worrying.) This seems like a worthwhile goal. However I don't see why this requires having multiple Python interpreters in a single O/S process. I assume it would be so that my code could link with library A, which embeds Python, and library B, which also embeds Python. A and B have no knowledge of each other. The part I have been involved in is moving away from process-global state. Process-global state can be made to work, but it is much safer to always default to module-local state (roughly what Python-language's global means), and treat process-global state as exceptions one has to think through. The API introduced in PEPs 384, 489, 573 (and future planned ones) aims to make module-local state possible to use, then later easy to use, and the natural default. I don't agree. Process level state is much safer than module-local state. Suppose two interpreters, have both imported the same module. By using O/S processes to keep the interpreters separate, the hardware prevents the two copies of the module from interfering with each other. By sharing an address space the separation is maintained by trust and hoping that third party modules don't have too many bugs. I don't see how you can claim the later case if safer. I've always assumed that per-module state meant per-module,
On 6/10/2020 8:33 AM, Mark Shannon wrote: per-interpreter.
It _can_, but it isn't guaranteed because we are talking about C here and people do "interesting" things when they are handed that much flexibility. 😉 Plus a bunch of work has been done in the last few years to make per-interpreter state for modules be supported. -Brett
Maybe I've misunderstood, in which case I agree with Mark. If per-module state isn't isolated per interpreter, that sort of kills the multiple interpreter model, in my mind. Eric
On 10 Jun 2020, at 14:33, Mark Shannon <mark@hotpy.org> wrote:
Hi Petr,
On 09/06/2020 2:24 pm, Petr Viktorin wrote:
On 2020-06-05 16:32, Mark Shannon wrote:
Hi,
There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process.
These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn.
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
What are sub-interpreters? --------------------------
A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers. Here's my biased take on the subject: Interpreters are contexts in which Python runs. They contain configuration (e.g. the import path) and runtime state (e.g. the set of imported modules). An interpreter is created at Python startup (Py_InitializeEx), and you can create/destroy additional ones with Py_NewInterpreter/Py_EndInterpreter. This is long-standing API that is used, most notably by mod_wsgi. Many extension modules and some stdlib modules don't play well with the existence of multiple interpreters in a process, mainly because they use process-global state (C static variables) rather than some more granular scope. This tends to result in nasty bugs (C-level crashes) when multiple interpreters are started in parallel (Py_NewInterpreter) or in sequence (several Py_InitializeEx/Py_FinalizeEx cycles). The bugs are similar in both cases. Whether Python interpreters run sequentially or in parallel, having them work will enable a use case I would like to see: allowing me to call Python code from wherever I want, without thinking about global state. Think calling Python from an utility library that doesn't care about the rest of the application it's used in. I personally call this "the Lua use case", because light-weight, worry-free embedding is an area where Python loses to Lua. (And JS as well—that's a relatively recent development, but much more worrying.)
This seems like a worthwhile goal. However I don't see why this requires having multiple Python interpreters in a single O/S process.
The mod_wsgi use case seems to require this (he writes without having looked at its source code). I have another possible use case: Independent plugins written in Python in native applications written in other languages. That doesn’t mean that is worthwhile to complicate the CPython code base for these. I have no opinion on that, both because I haven’t been active for a while and because I haven’t looked at the impact the current work has had. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/
Hi, as an user, the "lua use case" is right what I need at work. I realize that for python this is a niche case, and most users don't need any of this, but I hope it will useful to understand why having multiple independent interpreters in a single process can be an essential feature. The company I work for develop and sells a big C++ financial system with python embedded, providing critical flexibility to our customers. Python is used as a scripting language, with most cases having C++ calling a python script itself calling other C++ functions. Most of the times those scripts are in workloads I/O bound or where the time spent in python is negligible. But some workloads are really cpu bound and those tend to become GIL-bound, even with massive use of C++ helpers; some to the point that GIL-contention makes up over 80% of running time, instead of 1-5%. And every time our customers upgrade their server, they buy machines with more cores and the contention problem worsens. Obviously, our use case calls for per-thread separate interpreters: server processes run continuously and already consume gigabytes of RAM, so startup time or increased memory consumption are not issues. Shared state also is not needed, actually we try to avoid it as much as possible. In the end, removing process-global state is extremely interesting for us. Thank you, Riccardo
Hi Riccardo, On 10/06/2020 5:51 pm, Riccardo Ghetta wrote:
Hi, as an user, the "lua use case" is right what I need at work. I realize that for python this is a niche case, and most users don't need any of this, but I hope it will useful to understand why having multiple independent interpreters in a single process can be an essential feature. The company I work for develop and sells a big C++ financial system with python embedded, providing critical flexibility to our customers. Python is used as a scripting language, with most cases having C++ calling a python script itself calling other C++ functions. Most of the times those scripts are in workloads I/O bound or where the time spent in python is negligible. > But some workloads are really cpu bound and those tend to become GIL-bound, even with massive use of C++ helpers; some to the point that GIL-contention makes up over 80% of running time, instead of 1-5%. And every time our customers upgrade their server, they buy machines with more cores and the contention problem worsens.
Different interpreters need to operate in their own isolated address space, or there will be horrible race conditions. Regardless of whether that separation is done in software or hardware, it has to be done. Whenever data contained in a Python object is passed to C/C++ code, there are two ways to do it. Either pass the whole object, or a reference to the underlying data. By passing the underlying data, you can release the GIL, and your problem is solved, or at least alleviated. If you can't do that, and must pass the object, then all accesses to that object must be protected by a per-interpreter lock. That's because interpreters need to operate serially, or you'll get horrible race conditions. If you need to share objects across threads, then there will be contention, regardless of how many interpreters there are, or which processes they are in.
Obviously, our use case calls for per-thread separate interpreters: server processes run continuously and already consume gigabytes of RAM, so startup time or increased memory consumption are not issues. Shared state also is not needed, actually we try to avoid it as much as possible. In the end, removing process-global state is extremely interesting for us.
If the additional resource consumption is irrelevant, what's the objection to spinning up a new processes? Cheers, Mark. P.S. Do try passing the underlying data, not the whole object, and dropping the GIL when calling back into C++. It can be effective. CPython already drops the GIL for some computational workloads implemented in C, like compression.
If you need to share objects across threads, then there will be contention, regardless of how many interpreters there are, or which processes they are in. As a rule, we don't use that many python objects. Most of the time a
If the additional resource consumption is irrelevant, what's the objection to spinning up a new processes? The additional resource consumption of a new python interpreter is irrelevant, but the process as a whole needs a lot of extra data making a new process rather costly. Plus there are issues of licensing, synchronization and load balancing
Hello Mark, and thanks for your suggestions. However, I'm afraid I haven't explained our use of python well enough. On 11/06/2020 12:59, Mark Shannon wrote: script calls C++ functions, operating on C++ data. Perhaps with a small snippet I will explain myself better : hcpi='INFLEUR' n_months=3 base_infl=hs_base(hcpi, n_months, 0) im=hs_fs(hcpi,'sia','m',n_months,0) ip=hs_fs(hcpi,'sia','m',n_months-1,0) ir=im+(hs_range()[1].day-1)/month_days(hs_range()[1])*(ip-im) return ir/base_infl # double this is a part of a inflation estimation used in pricing an inflation-linked bond. hcpi and n_months are really parameters of the script and the hs_ functions are all implemented in C++. Some are very small and fast like hs_range, others are much more complex and slow (hs_fs), so we wrap them with Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS. As you see, here python is used more to direct C++, than manipulate objects. At GUI level things work a bit differently, but here we just tried to avoid building and destroying a lot of ephemeral python objects (unneeded anyway, because all subsequent processing is done by C++). This python script is only a part of a larger processing done in parallel by several threads, each operating in distinct instruments. Evaluating an instrument could involve zero, one, or several of those scripts. During evaluation an instrument is bound to a single thread, so from the point of view of python threads share nothing. that are much easier to resolve (for our system, at least) with threads than processes. Still, we /do/ use multiple processes, but those tend to be across administrative boundaries, or for very specific issues. Ciao, Riccardo
On 11/06/2020 2:50 pm, Riccardo Ghetta wrote:
Hello Mark, and thanks for your suggestions. However, I'm afraid I haven't explained our use of python well enough.
If you need to share objects across threads, then there will be contention, regardless of how many interpreters there are, or which processes they are in. As a rule, we don't use that many python objects. Most of the time a
On 11/06/2020 12:59, Mark Shannon wrote: script calls C++ functions, operating on C++ data. Perhaps with a small snippet I will explain myself better :
hcpi='INFLEUR' n_months=3 base_infl=hs_base(hcpi, n_months, 0) im=hs_fs(hcpi,'sia','m',n_months,0) ip=hs_fs(hcpi,'sia','m',n_months-1,0) ir=im+(hs_range()[1].day-1)/month_days(hs_range()[1])*(ip-im) return ir/base_infl # double
this is a part of a inflation estimation used in pricing an inflation-linked bond. hcpi and n_months are really parameters of the script and the hs_ functions are all implemented in C++. Some are very small and fast like hs_range, others are much more complex and slow (hs_fs), so we wrap them with Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS. As you see, here python is used more to direct C++, than manipulate objects. At GUI level things work a bit differently, but here we just tried to avoid building and destroying a lot of ephemeral python objects (unneeded anyway, because all subsequent processing is done by C++). This python script is only a part of a larger processing done in parallel by several threads, each operating in distinct instruments. Evaluating an instrument could involve zero, one, or several of those scripts. During evaluation an instrument is bound to a single thread, so from the point of view of python threads share nothing.
If the additional resource consumption is irrelevant, what's the objection to spinning up a new processes? The additional resource consumption of a new python interpreter is irrelevant, but the process as a whole needs a lot of extra data making a new process rather costly.
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes. The overhead largely comes from what you do with the process. The additional cost of starting a new interpreter is the same regardless of whether it is in the same process or not. There should be no need to start a new application process for a new Python interpreter.
Plus there are issues of licensing, synchronization and load balancing that are much easier to resolve (for our system, at least) with threads than processes.
Would this prevent CPython starting new processes, or is this just for processes managed by your application?
Still, we /do/ use multiple processes, but those tend to be across administrative boundaries, or for very specific issues.
Ciao, Riccardo
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org> wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes.
Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-) Personally, I'm in favour of multiple interpreter support mostly for the same reasons as Petr (easy embedding, in the style of Lua). Exposing interpreters to Python, and per-interpreter GILs, strike me as really interesting areas for experimentation, but I'm reserving final judgement on the practical benefits until we have working code and some practical experience. The incremental costs for those are low, though, as the bulk of the work is actually needed for the "easy embedding" use case. Paul
On 6/12/2020 5:08 AM, Paul Moore wrote:
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org> wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes. Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-) Here's a test on Windows 10, 4 logical cpus, 8 GB of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.6297528999999997 timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 40.281721199999964
Or this way:
timeit.timeit("""os.system('python.exe -c "exit()"')""",number=100, globals=globals()) 17.461259299999995
--Edwin
On 6/12/2020 6:18 AM, Edwin Zimmerman wrote:
On 6/12/2020 5:08 AM, Paul Moore wrote:
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org> wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes. Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-) Here's a test on Windows 10, 4 logical cpus, 8 GB of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.6297528999999997 timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 40.281721199999964
Or this way:
timeit.timeit("""os.system('python.exe -c "exit()"')""",number=100, globals=globals()) 17.461259299999995
--Edwin For comparison, on a single core linux cloud server with 512 mb of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.354354709998006 timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 3.847851719998289 So yeah, process creation is still rather costly on Windows.
Hi Edwin, Thanks for providing some concrete numbers. Is it expected that creating 100 processes takes 6.3ms per process, but that creating 1000 process takes 40ms per process? That's over 6 times as long in the latter case. Cheers, Mark. On 12/06/2020 11:29 am, Edwin Zimmerman wrote:
On 6/12/2020 6:18 AM, Edwin Zimmerman wrote:
On 6/12/2020 5:08 AM, Paul Moore wrote:
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org> wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes. Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-) Here's a test on Windows 10, 4 logical cpus, 8 GB of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.6297528999999997 timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 40.281721199999964
Or this way:
timeit.timeit("""os.system('python.exe -c "exit()"')""",number=100, globals=globals()) 17.461259299999995
--Edwin For comparison, on a single core linux cloud server with 512 mb of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.354354709998006
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 3.847851719998289
So yeah, process creation is still rather costly on Windows. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BLBNMZXK... Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, Jun 12, 2020 at 7:19 AM Mark Shannon <mark@hotpy.org> wrote:
Hi Edwin,
Thanks for providing some concrete numbers. Is it expected that creating 100 processes takes 6.3ms per process, but that creating 1000 process takes 40ms per process? That's over 6 times as long in the latter case.
Cheers, Mark.
On 12/06/2020 11:29 am, Edwin Zimmerman wrote:
On 6/12/2020 6:18 AM, Edwin Zimmerman wrote:
On 6/12/2020 5:08 AM, Paul Moore wrote:
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org> wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes. Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-) Here's a test on Windows 10, 4 logical cpus, 8 GB of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.6297528999999997
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 40.281721199999964
Or this way:
timeit.timeit("""os.system('python.exe -c "exit()"')""",number=100, globals=globals()) 17.461259299999995
--Edwin For comparison, on a single core linux cloud server with 512 mb of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals())
0.354354709998006
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals())
3.847851719998289
So yeah, process creation is still rather costly on Windows.
I was wondering that too, some tests show that process creation/destruction starts to seriously bog down after a few hundred in a row. I guess it's hitting some resource limits it has to clean up, though creating hundreds of processes at once sounds like an antipattern that doesn't really deserve too much consideration. It would be rare that fork is more than a negligible part of any workload. (With A/V on, though, it's _much_ slower out the gate. I'm seeing over 100ms per process with Kaspersky running.) Em
My previous timings were slightly inaccurate, as they compared spawning processes on Windows to forking on Linux. Also, I changed my timing code to run all process synchronously, to avoid hitting resource limits. Updated Windows (Windows 7 this time, on a four core processor):
timeit.timeit('x=multiprocessing.Process(target=exit);x.start();x.join()', number=1000,globals = globals())
84.7111053659259 Updated Linux with spawn (single core processor):
ctx = multiprocessing.get_context('spawn')
timeit.timeit('x=ctx.Process(target=exit);x.start();x.join()', number=1000,globals = globals())
60.01154333699378 Updated Linux with fork:
timeit.timeit('x=multiprocessing.Process(target=exit);x.start();x.join()', number=1000,globals = globals())
4.402019854984246 Compare this to subinterpreters on my linux machine:
timeit.timeit('s=_xxsubinterpreters.create();_xxsubinterpreters.destroy(s)',number=1000, globals=globals())
13.47043095799745 This shows that is speed is all that matters, multiprocessing comes out way ahead of subinterpreters on linux, but way behind on Windows. I need to time subinterpreters on Windows yet for the full picture, but that will be tomorrow till I get that done. --Edwin From: Emily Bowman [mailto:silverbacknet@gmail.com] Sent: Friday, June 12, 2020 12:44 PM To: Mark Shannon <mark@hotpy.org> Cc: Python Dev <python-dev@python.org> Subject: [Python-Dev] Re: My take on multiple interpreters (Was: Should we be making so many changes in pursuit of PEP 554?) On Fri, Jun 12, 2020 at 7:19 AM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org> > wrote: Hi Edwin, Thanks for providing some concrete numbers. Is it expected that creating 100 processes takes 6.3ms per process, but that creating 1000 process takes 40ms per process? That's over 6 times as long in the latter case. Cheers, Mark. On 12/06/2020 11:29 am, Edwin Zimmerman wrote:
On 6/12/2020 6:18 AM, Edwin Zimmerman wrote:
On 6/12/2020 5:08 AM, Paul Moore wrote:
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org> > wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes. Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-) Here's a test on Windows 10, 4 logical cpus, 8 GB of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.6297528999999997 timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 40.281721199999964
Or this way:
timeit.timeit("""os.system('python.exe -c "exit()"')""",number=100, globals=globals()) 17.461259299999995
--Edwin For comparison, on a single core linux cloud server with 512 mb of ram:
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=100, globals=globals()) 0.354354709998006
timeit.timeit("""multiprocessing.Process(target=exit).start()""",number=1000, globals=globals()) 3.847851719998289
So yeah, process creation is still rather costly on Windows.
I was wondering that too, some tests show that process creation/destruction starts to seriously bog down after a few hundred in a row. I guess it's hitting some resource limits it has to clean up, though creating hundreds of processes at once sounds like an antipattern that doesn't really deserve too much consideration. It would be rare that fork is more than a negligible part of any workload. (With A/V on, though, it's _much_ slower out the gate. I'm seeing over 100ms per process with Kaspersky running.) Em
On Sat, Jun 13, 2020 at 3:50 AM Edwin Zimmerman <edwin@211mainstreet.net> wrote:
My previous timings were slightly inaccurate, as they compared spawning processes on Windows to forking on Linux. Also, I changed my timing code to run all process synchronously, to avoid hitting resource limits.
Updated Windows (Windows 7 this time, on a four core processor):
timeit.timeit('x=multiprocessing.Process(target=exit);x.start();x.join()', number=1000,globals = globals()) 84.7111053659259
Thanks, I was actually going to ask about joining the processes, since you don't really get a good indication of timings from asynchronous operations like that. Another interesting data point is that starting and joining in batches makes a fairly huge difference to performance, at least on my Linux system. Starting with your example and rescaling the number by ten to compensate for performance differences:
timeit.timeit('x=multiprocessing.Process(target=exit);x.start();x.join()', number=10000,globals = globals()) 14.261007152497768
Just for completeness and consistency, confirmed that adding a list comp around it doesn't change the timings:
timeit.timeit('xx=[multiprocessing.Process(target=exit) for _ in range(1)];[x.start() for x in xx];[x.join() for x in xx]', number=10000,globals = globals()) 14.030426062643528
But doing a hundred at a time and then joining them all cuts the time in half!
timeit.timeit('xx=[multiprocessing.Process(target=exit) for _ in range(100)];[x.start() for x in xx];[x.join() for x in xx]', number=100,globals = globals()) 5.470761131495237
The difference is even more drastic with spawn, although since it's slower, I also lowered the number of iterations.
ctx = multiprocessing.get_context('spawn') timeit.timeit('x=ctx.Process(target=exit);x.start();x.join()', number=1000,globals = globals()) 40.82687543518841 timeit.timeit('xx=[ctx.Process(target=exit) for _ in range(100)];[x.start() for x in xx];[x.join() for x in xx]', number=10,globals = globals())8.566341979429126 8.566341979429126
Would be curious to know if that's the same on Windows. ChrisA
On 6/12/2020 2:17 PM, Chris Angelico wrote:
On Sat, Jun 13, 2020 at 3:50 AM Edwin Zimmerman <edwin@211mainstreet.net> wrote:
My previous timings were slightly inaccurate, as they compared spawning processes on Windows to forking on Linux. Also, I changed my timing code to run all process synchronously, to avoid hitting resource limits.
Updated Windows (Windows 7 this time, on a four core processor):
timeit.timeit('x=multiprocessing.Process(target=exit);x.start();x.join()', number=1000,globals = globals()) 84.7111053659259
Thanks, I was actually going to ask about joining the processes, since you don't really get a good indication of timings from asynchronous operations like that. Another interesting data point is that starting and joining in batches makes a fairly huge difference to performance, at least on my Linux system. Starting with your example and rescaling the number by ten to compensate for performance differences:
timeit.timeit('x=multiprocessing.Process(target=exit);x.start();x.join()', number=10000,globals = globals()) 14.261007152497768
Just for completeness and consistency, confirmed that adding a list comp around it doesn't change the timings:
timeit.timeit('xx=[multiprocessing.Process(target=exit) for _ in range(1)];[x.start() for x in xx];[x.join() for x in xx]', number=10000,globals = globals()) 14.030426062643528
But doing a hundred at a time and then joining them all cuts the time in half!
timeit.timeit('xx=[multiprocessing.Process(target=exit) for _ in range(100)];[x.start() for x in xx];[x.join() for x in xx]', number=100,globals = globals()) 5.470761131495237
The difference is even more drastic with spawn, although since it's slower, I also lowered the number of iterations.
ctx = multiprocessing.get_context('spawn') timeit.timeit('x=ctx.Process(target=exit);x.start();x.join()', number=1000,globals = globals()) 40.82687543518841 timeit.timeit('xx=[ctx.Process(target=exit) for _ in range(100)];[x.start() for x in xx];[x.join() for x in xx]', number=10,globals = globals())8.566341979429126 8.566341979429126
Would be curious to know if that's the same on Windows. Yea, it's the same. Watch your cpu utilization, and you will realize that your list comprehension is parallelizing the process startups. ChrisA _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YFRM3LNO... Code of Conduct: http://python.org/psf/codeofconduct/
On 12Jun2020 1008, Paul Moore wrote:
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org> wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes.
Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-)
It's still like that, though I'm actively involved in trying to get it improved. However, it's unlikely at this point to ever get to equivalence with Unix - Windows just sets up too many features (security, isolation, etc.) at the process boundary rather than other parts of the lifecycle. It's also *incredibly arrogant* to insist that users rewrite their applications to suit Python, rather than us doing the work to fit their needs. That's not how being a libraries/runtime developer works. Our responsibility is to humbly do the work that will benefit our users, not to find ways to put in the least possible effort and use the rest for blame-shifting. Some of us do much more talking than listening, and it does not pass unnoticed. Cheers, Steve
Hi Steve, On 12/06/2020 12:43 pm, Steve Dower wrote:
On 12Jun2020 1008, Paul Moore wrote:
On Fri, 12 Jun 2020 at 09:47, Mark Shannon <mark@hotpy.org> wrote:
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes.
Is that on Windows or Unix? Traditionally, process creation has been costly on Windows, which is why threads, and in-process solutions in general, tend to be more common on that platform. I haven't done experiments recently, but I do tend to avoid multiprocess-type solutions on Windows "just in case". I know that evaluating a new feature based on unsubstantiated assumptions informed by "it used to be like this" is ill-advised, but so is assuming that everything will be OK based on experience on a single platform :-)
It's still like that, though I'm actively involved in trying to get it improved. However, it's unlikely at this point to ever get to equivalence with Unix - Windows just sets up too many features (security, isolation, etc.) at the process boundary rather than other parts of the lifecycle. > It's also *incredibly arrogant* to insist that users rewrite their applications to suit Python, rather than us doing the work to fit their needs. That's not how being a libraries/runtime developer works. Our responsibility is to humbly do the work that will benefit our users, not to find ways to put in the least possible effort and use the rest for blame-shifting. Some of us do much more talking than listening, and it does not pass unnoticed. I don't think anyone is suggesting that users rewrite their code to work with existing features. Using any new feature is going to take some work on the users part, and we are talking about a new feature.
Developer time is a finite resource and any time spent helping one set of users is not spent helping others. Likewise, optimizing for one use case may be hurting performance for other use cases. Personally, I have no idea what is important for other people, but I would like any discussion to have sound technical underpinnings. Once we have those, it becomes possible to have a meaningful discussion. Cheers, Mark.
On 11/06/2020 2:50 pm, Riccardo Ghetta wrote: On 11/06/2020 12:59, Mark Shannon wrote:
If the additional resource consumption is irrelevant, what's the objection to spinning up a new processes? The additional resource consumption of a new python interpreter is irrelevant, but the process as a whole needs a lot of extra data making a new process rather costly.
Plus there are issues of licensing, synchronization and load balancing that are much easier to resolve (for our system, at least) with threads than processes. Would this prevent CPython starting new processes, or is this just for
Starting a new process is cheap. On my machine, starting a new Python process takes under 1ms and uses a few Mbytes. Sorry, I wasn't clear here. I was talking about starting one of our server processes, /with python embedded/. Since python routines are called by our C++ code and need to call other C++ routines, it cannot work alone and is surrounded by a lot of data needed for the C++ part. A python interpreter by itself would be like a cpu chip for someone needing a server. A critical component, sure, but only a small part of the whole. processes managed by your application? Is only for application processes, but because python is always embedded
On 12/06/2020 10:45, Mark Shannon wrote: there is little practical difference. I hope to not come out arrogant or dismissive, but can we take it from granted that multiprocessing is not a viable solution for our application, or at least that it would be impractical and too expensive rebuilding it from scratch to change paradigm ? At the same time, I realize that ours is a somewhat niche case and it may not be deemed interesting for python evolution. I just wanted to present a real world example of someone using python today and who would benefit immensely if python would permit multiple, separate, interpreters in a single process. Or any other solution removing the bottlenecks that currently so limit multithreaded python performance. Ciao, Riccardo
On Fri, Jun 12, 2020 at 2:49 AM Mark Shannon <mark@hotpy.org> wrote:
The overhead largely comes from what you do with the process. The additional cost of starting a new interpreter is the same regardless of whether it is in the same process or not.
FWIW, there's more to it than that: * there is some overhead to starting the runtime and main interpreter that does not apply to additional in-process interpreters * I don't see why we shouldn't be able to come up with a strategy for interpreter startup that does not involve copying or sharing a lot of interpreter state, thus reducing startup time and memory consumption * I'm guessing that re-importing builtin/extension modules is faster than importing then new in a separate process -eric
Hi Eric, On 12/06/2020 4:17 pm, Eric Snow wrote:
On Fri, Jun 12, 2020 at 2:49 AM Mark Shannon <mark@hotpy.org> wrote:
The overhead largely comes from what you do with the process. The additional cost of starting a new interpreter is the same regardless of whether it is in the same process or not.
FWIW, there's more to it than that:
* there is some overhead to starting the runtime and main interpreter that does not apply to additional in-process interpreters
You seem to be implying that there would be more overhead for a new interpreter that operates in a different O/S process. What would that be?
* I don't see why we shouldn't be able to come up with a strategy for interpreter startup that does not involve copying or sharing a lot of interpreter state, thus reducing startup time and memory consumption
Indeed, that would be beneficial regardless of which process the interpreter is in.
* I'm guessing that re-importing builtin/extension modules is faster than importing then new in a separate process
Each new interpreter need to re-import the modules. The overhead could be reduced by making more of the module immutable, allowing some sharing. For linux, at least, that benefit would apply to multiple processes as well. Cheers, Mark.
On 6/11/2020 6:59 AM, Mark Shannon wrote:
Hi Riccardo,
On 10/06/2020 5:51 pm, Riccardo Ghetta wrote:
Hi, as an user, the "lua use case" is right what I need at work. I realize that for python this is a niche case, and most users don't need any of this, but I hope it will useful to understand why having multiple independent interpreters in a single process can be an essential feature. The company I work for develop and sells a big C++ financial system with python embedded, providing critical flexibility to our customers. Python is used as a scripting language, with most cases having C++ calling a python script itself calling other C++ functions. Most of the times those scripts are in workloads I/O bound or where the time spent in python is negligible. > But some workloads are really cpu bound and those tend to become GIL-bound, even with massive use of C++ helpers; some to the point that GIL-contention makes up over 80% of running time, instead of 1-5%. And every time our customers upgrade their server, they buy machines with more cores and the contention problem worsens.
Different interpreters need to operate in their own isolated address space, or there will be horrible race conditions. Regardless of whether that separation is done in software or hardware, it has to be done.
I realize this is true now, but why must it always be true? Can't we fix this? At least one solution has been proposed: passing around a pointer to the current interpreter. I realize there issues here, like callbacks and signals that will need to be worked out. But I don't think it's axiomatically true that we'll always have race conditions with multiple interpreters in the same address space. Eric
On 12/06/2020 12:55, Eric V. Smith wrote:
On 6/11/2020 6:59 AM, Mark Shannon wrote:
Different interpreters need to operate in their own isolated address space, or there will be horrible race conditions. Regardless of whether that separation is done in software or hardware, it has to be done.
I realize this is true now, but why must it always be true? Can't we fix this? At least one solution has been proposed: passing around a pointer to the current interpreter. I realize there issues here, like callbacks and signals that will need to be worked out. But I don't think it's axiomatically true that we'll always have race conditions with multiple interpreters in the same address space.
Eric
Axiomatically? No, but let me rise to the challenge. If (1) interpreters manage the life-cycle of objects, and (2) a race condition arises when the life-cycle or state of an object is accessed by the interpreter that did not create it, and (3) an object will sometimes be passed to an interpreter that did not create it, and (4) an interpreter with a reference to an object will sometimes access its life-cycle or state, then (5) a race condition will sometimes arise. This seems to be true (as a deduction) if all the premises hold. (1) and (2) are true in CPython as we know it. (3) is prevented (completely?) by the Python API, but not at all by the C API. (4) is implicit in an interpreter having access to an object, the way CPython and its extensions are written, so (5) follows in the case that the C API is used. You could change (1) and/or (2), maybe (4). "Passing around a pointer to the current interpreter" sounds like an attempt to break (2) or maybe (4). But I don't understand "current". What you need at any time is the interpreter (state and life-cycle manager) for the object you're about to handle, so that the receiving interpreter can delegate the action, instead of crashing ahead itself. This suggests a reference to the interpreter must be embedded in each object, but it could be implicit in the memory address. There is then still an issue that the owning interpreter has to be thread-safe (if there are threads) in the sense that it can serialise access to object state or life-cycle. If serialisation is by a GIL, the receiving interpreter must take the GIL of the owning interpreter, and we are somewhat back where we started. Note that the "current interpreter" is not a function of the current thread (or vice-versa). The current thread is running in both interpreters, and by hypothesis, so are the competing threads. Can I just point out that, while most of this argument concerns a particular implementation, we have a reason in Python (the language) for an interpreter construct: it holds the current module context, so that whenever code is executing, we can give definite meaning to the 'import' statement. Here "current interpreter" does have a meaning, and I suggest it needs to be made a property of every function object as it is defined, and picked up when the execution frame is created. This *may* help with the other, internal, use of interpreter, for life-cycle and state management, because it provides a recognisable point (function call) where one may police object ownership, but that isn't why you need it. Jeff Allen
On 6/17/2020 12:07 PM, Jeff Allen wrote:
On 12/06/2020 12:55, Eric V. Smith wrote:
On 6/11/2020 6:59 AM, Mark Shannon wrote:
Different interpreters need to operate in their own isolated address space, or there will be horrible race conditions. Regardless of whether that separation is done in software or hardware, it has to be done.
I realize this is true now, but why must it always be true? Can't we fix this? At least one solution has been proposed: passing around a pointer to the current interpreter. I realize there issues here, like callbacks and signals that will need to be worked out. But I don't think it's axiomatically true that we'll always have race conditions with multiple interpreters in the same address space.
Eric
Axiomatically? No, but let me rise to the challenge.
If (1) interpreters manage the life-cycle of objects, and (2) a race condition arises when the life-cycle or state of an object is accessed by the interpreter that did not create it, and (3) an object will sometimes be passed to an interpreter that did not create it, and (4) an interpreter with a reference to an object will sometimes access its life-cycle or state, then (5) a race condition will sometimes arise. This seems to be true (as a deduction) if all the premises hold.
(1) and (2) are true in CPython as we know it. (3) is prevented (completely?) by the Python API, but not at all by the C API. (4) is implicit in an interpreter having access to an object, the way CPython and its extensions are written, so (5) follows in the case that the C API is used. You could change (1) and/or (2), maybe (4).
I'm assuming that passing an object between interpreters would not be supported. It would require that the object somehow be marshalled between interpreters, so that no object would be operated on outside the interpreter that created it. So 2-5 couldn't happen in valid code.
"Passing around a pointer to the current interpreter" sounds like an attempt to break (2) or maybe (4). But I don't understand "current". What you need at any time is the interpreter (state and life-cycle manager) for the object you're about to handle, so that the receiving interpreter can delegate the action, instead of crashing ahead itself. This suggests a reference to the interpreter must be embedded in each object, but it could be implicit in the memory address.
Sorry for being loose with terms. If I want to create an interpreter and execute it, then I'd allocate and initialize an interpreter state object, then call it, passing the interpreter state object in to whatever Python functions I want to call. They would in turn pass that pointer to whatever they call, or access the state through it directly. That pointer is the "current interpreter".
There is then still an issue that the owning interpreter has to be thread-safe (if there are threads) in the sense that it can serialise access to object state or life-cycle. If serialisation is by a GIL, the receiving interpreter must take the GIL of the owning interpreter, and we are somewhat back where we started. Note that the "current interpreter" is not a function of the current thread (or vice-versa). The current thread is running in both interpreters, and by hypothesis, so are the competing threads.
Agreed that an interpreter shouldn't belong to a thread, but since an interpreter couldn't access objects of another interpreter, there'd be no need for cross-intepreter locking. There would be a GIL per interpreter, protecting access to that interpreter's state.
Can I just point out that, while most of this argument concerns a particular implementation, we have a reason in Python (the language) for an interpreter construct: it holds the current module context, so that whenever code is executing, we can give definite meaning to the 'import' statement. Here "current interpreter" does have a meaning, and I suggest it needs to be made a property of every function object as it is defined, and picked up when the execution frame is created. This *may* help with the other, internal, use of interpreter, for life-cycle and state management, because it provides a recognisable point (function call) where one may police object ownership, but that isn't why you need it.
There's a lot of state per interpreter, including the module state. See "struct _is" in Include/internal/pycore_interp.h. Eric
Jeff Allen
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GACVQJNC... Code of Conduct: http://python.org/psf/codeofconduct/
On 17/06/2020 19:28, Eric V. Smith wrote:
If (1) interpreters manage the life-cycle of objects, and (2) a race condition arises when the life-cycle or state of an object is accessed by the interpreter that did not create it, and (3) an object will sometimes be passed to an interpreter that did not create it, and (4) an interpreter with a reference to an object will sometimes access its life-cycle or state, then (5) a race condition will sometimes arise. This seems to be true (as a deduction) if all the premises hold. I'm assuming that passing an object between interpreters would not be supported. It would require that the object somehow be marshalled between interpreters, so that no object would be operated on outside
On 6/17/2020 12:07 PM, Jeff Allen wrote: the interpreter that created it. So 2-5 couldn't happen in valid code.
The Python level doesn't support it, prevents it I think, and perhaps the implementation doesn't support it, but nothing can stop C actually doing it. I would agree that with sufficient discipline in the code it should be possible to prevent the worlds from colliding. But it is difficult, so I think that is why Mark is arguing for a separate address space. Marshalling the value across is supported, but that's just the value, not a shared object.
Sorry for being loose with terms. If I want to create an interpreter and execute it, then I'd allocate and initialize an interpreter state object, then call it, passing the interpreter state object in to whatever Python functions I want to call. They would in turn pass that pointer to whatever they call, or access the state through it directly. That pointer is the "current interpreter".
I think that can work if you have disciplined separation, which you are assuming. I think you would pass the function to the interpreter, not the other way around. I'm assuming this is described from the perspective of some C code and your Python functions are PyFunction objects, not just text? What, however, prevents you creating that function in one interpreter and giving it to another? The function, and any closure or defaults are owned by the creating interpreter.
There's a lot of state per interpreter, including the module state. See "struct _is" in Include/internal/pycore_interp.h.
So much more than when I last looked! Look back in time and interpreter state mostly contains the module context (in a broad sense that includes shortcuts to sys, builtins, codec state, importlib). Ok, there's some stuff about exit handling and debugging too. The recent huge growth is to shelter previously singleton object allocation mechanisms, a consequence of the implementation choice that gives the interpreter object that responsibility too. I'm not saying this is wrong, just that it's not a concept in Python-the-language, while the module state is. Jeff
On 6/17/2020 6:03 PM, Jeff Allen wrote:
On 17/06/2020 19:28, Eric V. Smith wrote:
If (1) interpreters manage the life-cycle of objects, and (2) a race condition arises when the life-cycle or state of an object is accessed by the interpreter that did not create it, and (3) an object will sometimes be passed to an interpreter that did not create it, and (4) an interpreter with a reference to an object will sometimes access its life-cycle or state, then (5) a race condition will sometimes arise. This seems to be true (as a deduction) if all the premises hold. I'm assuming that passing an object between interpreters would not be supported. It would require that the object somehow be marshalled between interpreters, so that no object would be operated on outside
On 6/17/2020 12:07 PM, Jeff Allen wrote: the interpreter that created it. So 2-5 couldn't happen in valid code.
The Python level doesn't support it, prevents it I think, and perhaps the implementation doesn't support it, but nothing can stop C actually doing it. I would agree that with sufficient discipline in the code it should be possible to prevent the worlds from colliding. But it is difficult, so I think that is why Mark is arguing for a separate address space. Marshalling the value across is supported, but that's just the value, not a shared object.
Yes, it's difficult to have the discipline in C, just as multi-threaded is difficult in C. I agree separate address spaces makes isolation much easier, but I think there are use cases that don't align with separate address spaces, and we should support those.
Sorry for being loose with terms. If I want to create an interpreter and execute it, then I'd allocate and initialize an interpreter state object, then call it, passing the interpreter state object in to whatever Python functions I want to call. They would in turn pass that pointer to whatever they call, or access the state through it directly. That pointer is the "current interpreter".
I think that can work if you have disciplined separation, which you are assuming. I think you would pass the function to the interpreter, not the other way around. I'm assuming this is described from the perspective of some C code and your Python functions are PyFunction objects, not just text? What, however, prevents you creating that function in one interpreter and giving it to another? The function, and any closure or defaults are owned by the creating interpreter.
In the C API (which is what I think we're discussing), I think it would be passing the interpreter state to the function. And nothing would prevent you from getting it wrong.
There's a lot of state per interpreter, including the module state. See "struct _is" in Include/internal/pycore_interp.h.
So much more than when I last looked! Look back in time and interpreter state mostly contains the module context (in a broad sense that includes shortcuts to sys, builtins, codec state, importlib). Ok, there's some stuff about exit handling and debugging too. The recent huge growth is to shelter previously singleton object allocation mechanisms, a consequence of the implementation choice that gives the interpreter object that responsibility too. I'm not saying this is wrong, just that it's not a concept in Python-the-language, while the module state is.
I think most of these changes are Victor's, and I think they're a step in the right direction. Since Python globals are really module state, it makes sense that that's the part that's visible to Python. Eric
There are the usual concurrency problems of "read a value, change it, store it back without checking whether it already changed". The only thing special about lifecycle happens at refcount 0, which should not happen when more than one interpreter has a reference. Similarly, C code can mess things up if it does something unsupported -- but that is already the case. C code *could* set the refcount to something random, but that wouldn't be considered a bug in python, because there isn't much python can do to prevent it -- and that doesn't change with a second interpreter.
I don't think that sharing data only by copying is the final plan. Proxied objects seem like a fairly obvious extension. I am also a bit suspicious of that great timing; perhaps latency is also important for startup?
Multiprocessing serialisation overheads are abysmal. With enough OS support you can attempt to mitigate that via shared memory mechanisms (which Davin added to the standard library), but it's impossible to get the overhead of doing that as low as actually using the address space of one OS process. For the rest of the email... multiprocessing isn't going anywhere. Within-process parallelism is just aiming to provide another trade-off point in design space for CPU bound workloads (one roughly comparable to the point where JS web workers sit). Cheers, Nick. On Sat., 6 Jun. 2020, 12:39 am Mark Shannon, <mark@hotpy.org> wrote:
Hi,
There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process.
These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn.
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
What are sub-interpreters? --------------------------
A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers.
How can they be implemented to support parallelism? ---------------------------------------------------
There are two obvious options. a) Many sub-interpreters in a single O/S process. I will call this the many-to-one model (many interpreters in one O/S process). b) One sub-interpreter per O/S process. This is what we currently have for multiprocessing. I will call this the one-to-one model (one interpreter in one O/S process).
There seems to be an assumption amongst those working on PEP 554 that the many-to-one model is the only way to support sub-interpreters that can execute in parallel. This isn't true. The one-to-one model has many advantages.
Advantages of the one-to-one model ----------------------------------
1. It's less bug prone. It is much easier to reason about code working in a single address space. Most code assumes
2. It's more secure. Separate O/S processes provide a much stronger boundary between interpreters. This is why some browsers use separate processes for browser tabs.
3. It can be implemented on top of the multiprocessing module, for testing. A more efficient implementation can be developed once sub-interpreters prove useful.
4. The required changes should have no negative performance impact.
5. Third party modules should continue to work as they do now.
6. It takes much less work :)
Performance -----------
Creating O/S processes is usually considered to be slow. Whilst processes are undoubtedly slower to create than threads, the absolute time to create a process is small; well under 1ms on linux.
Creating a new sub-interpreter typically requires importing quite a few modules before any useful work can be done. The time spent doing these imports will dominate the time to create an O/S process or thread.
If sub-interpreters are to be used for parallelism, there is no need to have many more sub-interpreters than CPU cores, so the overhead should be small. For additional concurrency, threads or coroutines can be used.
The one-to-one model is faster as it uses the hardware for interpreter separation, whereas the many-to-one model must use software. Process separation by the hardware virtual memory system has zero cost. Separation done in software needs extra memory reads when doing allocation or deallocation.
Overall, for any interpreter that runs for a second or more, it is likely that the one-to-one model would be faster.
Timings of multiprocessing & threads on my machine (6-core 2019 laptop) -----------------------------------------------------------------------
#Threads
def foo(): pass
def spawn_and_join(count): threads = [ Thread(target=foo, args=()) for _ in range(count) ] for t in threads: t.start() for t in threads: t.join()
spawn_and_join(1000)
# Processes
def spawn_and_join(count): processes = [ Process(target=foo, args=()) for _ in range(count) ] for p in processes: p.start() for p in processes: p.join()
spawn_and_join(1000)
Wall clock time for threads: 86ms. Less than 0.1ms per thread.
Wall clock time for processes: 370ms. Less than 0.4ms per process.
Processes are slower, but plenty fast enough.
Cheers, Mark.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5YNWDIYE... Code of Conduct: http://python.org/psf/codeofconduct/
Has anybody brought up the problem yet that if one subinterpreter encounters a hard crash (say, it segfaults due to a bug in a C extension module), all subinterpreters active at that moment in the same process are likely to lose all their outstanding work, without a chance of recovery? (Of course once we have locks in shared memory, a crashed process leaving a lock behind may also screw up everybody else, though perhaps less severely.) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
That's so, but threads have this problem too. I don't think this discussion is about finding a "perfect" solution or an "ultimate" way of doing things, rather it is about the varying opinions on certain design tradeoffs. If I'm satisfied that subinterpreters are the correct solution to my particular need, why shouldn't I have the privilege of doing so? --Edwin ----- Original Message ----- From: Guido van Rossum (guido@python.org) Date: 06/16/20 13:30 To: Python Dev (python-dev@python.org) Subject: [Python-Dev] Re: Should we be making so many changes in pursuit of PEP 554? Has anybody brought up the problem yet that if one subinterpreter encounters a hard crash (say, it segfaults due to a bug in a C extension module), all subinterpreters active at that moment in the same process are likely to lose all their outstanding work, without a chance of recovery? (Of course once we have locks in shared memory, a crashed process leaving a lock behind may also screw up everybody else, though perhaps less severely.) -- --Guido van Rossum (python.org/~guido) Pronouns: he/him (why is my pronoun here?) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AMX6KO7G... Code of Conduct: http://python.org/psf/codeofconduct/
On Tue, Jun 16, 2020 at 10:52 AM Edwin <edwin@211mainstreet.net> wrote:
That's so, but threads have this problem too. I don't think this discussion is about finding a "perfect" solution or an "ultimate" way of doing things, rather it is about the varying opinions on certain design tradeoffs. If I'm satisfied that subinterpreters are the correct solution to my particular need, why shouldn't I have the privilege of doing so?
Interesting choice of word. This is open source, no feature is free, you are not entitled to anything in particular. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On 2020-06-16 19:20, Guido van Rossum wrote:
Has anybody brought up the problem yet that if one subinterpreter encounters a hard crash (say, it segfaults due to a bug in a C extension module), all subinterpreters active at that moment in the same process are likely to lose all their outstanding work, without a chance of recovery?
(Of course once we have locks in shared memory, a crashed process leaving a lock behind may also screw up everybody else, though perhaps less severely.)
Not really. Asyncio has the same problem; has anyone brought this issue up there? (Granted, asyncio probably didn't uncover too many issues in extension modules, but if it did, I assume they would get fixed.) If you're worried about segfaults, then you should use multiple processes. That will always give you better isolation. But I don't think it's a reason to stop improving interpreter isolation.
On 16/06/2020 1:24 pm, Nick Coghlan wrote:
Multiprocessing serialisation overheads are abysmal. With enough OS support you can attempt to mitigate that via shared memory mechanisms (which Davin added to the standard library), but it's impossible to get the overhead of doing that as low as actually using the address space of one OS process.
What does "multiprocessing serialisation" even mean? I assume you mean the overhead of serializing objects for communication between processes. The cost of serializing an object has absolutely nothing to do with which process the interpreter is running in. Separate interpreters within a single process will still need to serialize objects for communication. The overhead of passing data through shared memory is the same for threads and processes. It's just memory. Can we please stick to facts and not throw around terms like "abysmal" with no data whatsoever to back it up.
For the rest of the email... multiprocessing isn't going anywhere.
Within-process parallelism is just aiming to provide another trade-off point in design space for CPU bound workloads (one roughly comparable to the point where JS web workers sit).
Cheers, Nick.
On Sat., 6 Jun. 2020, 12:39 am Mark Shannon, <mark@hotpy.org <mailto:mark@hotpy.org>> wrote:
Hi,
There have been a lot of changes both to the C API and to internal implementations to allow multiple interpreters in a single O/S process.
These changes cause backwards compatibility changes, have a negative performance impact, and cause a lot of churn.
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
What are sub-interpreters? --------------------------
A sub-interpreter is a logically independent Python process which supports inter-interpreter communication built on shared memory and channels. Passing of Python objects is supported, but only by copying, not by reference. Data can be shared via buffers.
How can they be implemented to support parallelism? ---------------------------------------------------
There are two obvious options. a) Many sub-interpreters in a single O/S process. I will call this the many-to-one model (many interpreters in one O/S process). b) One sub-interpreter per O/S process. This is what we currently have for multiprocessing. I will call this the one-to-one model (one interpreter in one O/S process).
There seems to be an assumption amongst those working on PEP 554 that the many-to-one model is the only way to support sub-interpreters that can execute in parallel. This isn't true. The one-to-one model has many advantages.
Advantages of the one-to-one model ----------------------------------
1. It's less bug prone. It is much easier to reason about code working in a single address space. Most code assumes
2. It's more secure. Separate O/S processes provide a much stronger boundary between interpreters. This is why some browsers use separate processes for browser tabs.
3. It can be implemented on top of the multiprocessing module, for testing. A more efficient implementation can be developed once sub-interpreters prove useful.
4. The required changes should have no negative performance impact.
5. Third party modules should continue to work as they do now.
6. It takes much less work :)
Performance -----------
Creating O/S processes is usually considered to be slow. Whilst processes are undoubtedly slower to create than threads, the absolute time to create a process is small; well under 1ms on linux.
Creating a new sub-interpreter typically requires importing quite a few modules before any useful work can be done. The time spent doing these imports will dominate the time to create an O/S process or thread.
If sub-interpreters are to be used for parallelism, there is no need to have many more sub-interpreters than CPU cores, so the overhead should be small. For additional concurrency, threads or coroutines can be used.
The one-to-one model is faster as it uses the hardware for interpreter separation, whereas the many-to-one model must use software. Process separation by the hardware virtual memory system has zero cost. Separation done in software needs extra memory reads when doing allocation or deallocation.
Overall, for any interpreter that runs for a second or more, it is likely that the one-to-one model would be faster.
Timings of multiprocessing & threads on my machine (6-core 2019 laptop) -----------------------------------------------------------------------
#Threads
def foo(): pass
def spawn_and_join(count): threads = [ Thread(target=foo, args=()) for _ in range(count) ] for t in threads: t.start() for t in threads: t.join()
spawn_and_join(1000)
# Processes
def spawn_and_join(count): processes = [ Process(target=foo, args=()) for _ in range(count) ] for p in processes: p.start() for p in processes: p.join()
spawn_and_join(1000)
Wall clock time for threads: 86ms. Less than 0.1ms per thread.
Wall clock time for processes: 370ms. Less than 0.4ms per process.
Processes are slower, but plenty fast enough.
Cheers, Mark.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org <mailto:python-dev@python.org> To unsubscribe send an email to python-dev-leave@python.org <mailto:python-dev-leave@python.org> https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5YNWDIYE... Code of Conduct: http://python.org/psf/codeofconduct/
On 2020-06-16 20:28, Mark Shannon wrote:
On 16/06/2020 1:24 pm, Nick Coghlan wrote:
Multiprocessing serialisation overheads are abysmal. With enough OS support you can attempt to mitigate that via shared memory mechanisms (which Davin added to the standard library), but it's impossible to get the overhead of doing that as low as actually using the address space of one OS process.
What does "multiprocessing serialisation" even mean? I assume you mean the overhead of serializing objects for communication between processes.
The cost of serializing an object has absolutely nothing to do with which process the interpreter is running in.
Separate interpreters within a single process will still need to serialize objects for communication.
The overhead of passing data through shared memory is the same for threads and processes. It's just memory.
Can we please stick to facts and not throw around terms like "abysmal" with no data whatsoever to back it up.
I'd like to get back to the facts. Let me quote the original mail from this thread: On 2020-06-05 16:32, Mark Shannon wrote:
While I'm in favour of PEP 554, or some similar model for parallelism in Python, I am opposed to the changes we are currently making to support it.
Which changes? There are several efforts in this general space. Personally, I also don't agree with them all. And I think the reason I wasn't able to formulate too many replies to you is that we don't have a common understanding of what is being discussed, and of the modivations behind the changes. You seem to try convince everyone that multiple processes are better (at isolation, and at performance) than multiple interpreters in one process. And I see the point: if you can live with the restriction of multiple processes, they probably are a better choice! But I don't think PEPs 554, 489, 573, etc. are about choosing between multiprocessing and multiple interpreters; they're about making multiple interpreters better than they currently are.
On Wed., 17 Jun. 2020, 4:28 am Mark Shannon, <mark@hotpy.org> wrote:
On 16/06/2020 1:24 pm, Nick Coghlan wrote:
Multiprocessing serialisation overheads are abysmal. With enough OS support you can attempt to mitigate that via shared memory mechanisms (which Davin added to the standard library), but it's impossible to get the overhead of doing that as low as actually using the address space of one OS process.
What does "multiprocessing serialisation" even mean? I assume you mean the overhead of serializing objects for communication between processes.
The cost of serializing an object has absolutely nothing to do with which process the interpreter is running in.
Separate interpreters within a single process will still need to serialize objects for communication.
The overhead of passing data through shared memory is the same for threads and processes. It's just memory.
No, it's not. With multiple processes, you have to instruct the OS to poke holes in the isolated-by-default behavior in order to give multiple Python interpreters access to a common memory store. When the interpreters are in the same process, that isn't true - to give multiple Python interpreters access, you just give them all a pointer to the common data. This will work most easily when the state being shared is not itself a Python object. PEP 3118 buffers will be one example of that (including when using pickle protocol 5 for data passing between interpreters), but the application embedding use case (where there's no real "main" interpreter, just multiple subinterpreters manipulating the application state) is the other one I expect to be reasonably common. This is the Ceph/mod_wsgi/hexchat plugin use case, which is beneficial enough for people to have pursued it *despite* the significant usability problems with the current state of the subinterpreter support. Doing full blown zero-copy ownership transfer of actual Python objects would be more difficult, since the current plan is to have separate memory allocation pools per interpreter to avoid excessive locking overhead, so I don't currently expect to see that any time soon, even if PEP 554 is accepted. Assuming that remains the case, I'd expect multiprocessing to remain the default choice for CPU bound use cases where all the interesting state is held in Python objects (if you're going to have to mess about with a separate heap of shared objects anyway, you may as well also enjoy the benefits of greater process isolation). Cheers, Nick.
On Wed, Jun 17, 2020 at 5:56 AM Nick Coghlan <ncoghlan@gmail.com> wrote:
Doing full blown zero-copy ownership transfer of actual Python objects would be more difficult, since the current plan is to have separate memory allocation pools per interpreter to avoid excessive locking overhead, so I don't currently expect to see that any time soon, even if PEP 554 is accepted. Assuming that remains the case, I'd expect multiprocessing to remain the default choice for CPU bound use cases where all the interesting state is held in Python objects (if you're going to have to mess about with a separate heap of shared objects anyway, you may as well also enjoy the benefits of greater process isolation).
So most likely there wouldn't be any way to share something like a bytearray or another buffer interface-compatible type for some time. That's too bad, I was hoping to have shared arrays that I could put a memoryview on in each thread/interpreter and deal with locking if I need to, but I suppose I can work through an extension once the changes stabilize. Packages like NumPy have had their own opaque C types and C-only routines to handle all the big threading outside of Python as a workaround for a long time now.
On Wed, Jun 17, 2020 at 11:42 AM Emily Bowman <silverbacknet@gmail.com> wrote:
So most likely there wouldn't be any way to share something like a bytearray or another buffer interface-compatible type for some time. That's too bad, I was hoping to have shared arrays that I could put a memoryview on in each thread/interpreter and deal with locking if I need to,
Earlier versions of PEP 554 did have a "SendChannel.send_buffer()" method for this but we tabled it in the interest of simplifying. That said, I expect we'll add something like that separately later.
but I suppose I can work through an extension once the changes stabilize.
Yep. This should be totally doable in an extension and hopefully without much effort.
Packages like NumPy have had their own opaque C types and C-only routines to handle all the big threading outside of Python as a workaround for a long time now.
As a workaround for what? This sounds interesting. :) -eric
On Thu., 18 Jun. 2020, 6:06 am Eric Snow, <ericsnowcurrently@gmail.com> wrote:
So most likely there wouldn't be any way to share something like a bytearray or another buffer interface-compatible type for some time. That's too bad, I was hoping to have shared arrays that I could put a memoryview on in each
On Wed, Jun 17, 2020 at 11:42 AM Emily Bowman <silverbacknet@gmail.com> wrote: thread/interpreter and deal with
locking if I need to,
Earlier versions of PEP 554 did have a "SendChannel.send_buffer()" method for this but we tabled it in the interest of simplifying. That said, I expect we'll add something like that separately later.
Right, buffers are different because the receiving interpreter can set up a memoryview that refers to storage allocated by the source interpreter. So the Python objects aren't shared (avoiding refcounting complications), but the expensive data copying step can still be avoided.
Packages like NumPy have had their own opaque C types and C-only routines to handle all the big threading outside of Python as a workaround for a long time now.
As a workaround for what? This sounds interesting. :)
For the GIL - lots of NumPy operations are in pure C or FORTRAN and will happily use as many CPUs as you have available. Cheers, Nick.
-eric
I wanted to let people know that the four of us on the SC not driving this work -- i.e. everyone but Victor -- talked about this at our last meeting and we support the work to isolate interpreter state from being global. There are benefits for the situation where you have to integrate CPython with other code which does its own thread management (i.e. the embedded scenario). It also helps from an organizational perspective of the code and thus we believe leads to easier maintainability long-term. We are okay with the performance trade-off required for this work. I will also say that while this work is a prerequisite for PEP 554 as currently proposed, it does not mean the SC believes PEP 554 will ultimately be accepted. We view this work as independently motivated from PEP 554.
participants (20)
-
Brett Cannon
-
Chris Angelico
-
Edwin
-
Edwin Zimmerman
-
Emily Bowman
-
Eric Snow
-
Eric V. Smith
-
Ethan Furman
-
Guido van Rossum
-
Inada Naoki
-
Jeff Allen
-
Jim J. Jewett
-
Mark Shannon
-
Nick Coghlan
-
Paul Moore
-
Petr Viktorin
-
Riccardo Ghetta
-
Ronald Oussoren
-
Steve Dower
-
Victor Stinner