[capi-sig]How to access the various levels of runtime state (e.g. PyInterpreterState, _PyRuntimeState)?
In https://bugs.python.org/issue36710 Victor wants to move away from the _PyRuntime C global, instead passing the _PyRuntimeState around explicitly. I'm in favor of the general idea, but not on a case-by-case basis like this. I left a comment on that issue (https://bugs.python.org/msg340945) that explains my position in more detail. In retrospect, I should have just posted here. :) So I've copied that comment below, as-is.
FYI, my intention is not to refuse Victor's objective in the issue. Rather, I want to make sure we have consensus on a valid broader objective on which to focus. This seemed like a perfect opportunity to start a discussion about it.
-eric
Status Quo
For simplicity sake, let's say nearly all the code operates relative to the 3 levels of runtime state:
- global - _PyRuntimeState
- interpreter - PyInterpreterState
- thread - PyThreadState
Furthermore, there are 3 groups of functions in the C-API:
- context-sensitive - operate relative to the current Python thread
- runtime-dependent - operate relative to some part of the runtime state, regardless of thread
- runtime-independent - have nothing to do with CPython's runtime state
Most of the C-API is context-sensitive. A small portion is runtime-dependent. A handful of functions are runtime-independent (effectively otherwise stateless helper functions that only happen to be part of the C-API).
Each context-sensitive function relies on there being a "runtime context" it can use relative to the current OS thread. That context consists of the current (i.e. active) PyThreadState, the corresponding PyInterpreterState, and the global _PyRuntimeState. That context is derived from data in TSS (see caveats below). This group includes most of the C-API.
Each runtime-dependent function operates against one or more runtime state target, regardless of the current thread context (or even if there isn't one). The target state (e.g. PyInterpreterState) is always passed explicitly. Again, this is only a small portion of the C-API.
Caveats: thread context
- for context-sensitive functions, we get the global runtime state from the global C variable (_PyRuntime) rather than via the implicit
- for some of the runtime-dependent functions that target _PyRuntimeState, we rely on the global C variable
All of this is the pattern we use currently. Using TSS to identify the implicit runtime context has certain benefits and costs:
benefits:
- sticking with the status quo means no backward incompatibility for existing C-extension code
- easier to distinguish the context-sensitive functions from the runtime-dependent ones
- (debatable) callers don't have to track, nor pass through, an extra argument
costs:
- extra complexity in keeping TSS correct
- makes the C-API bigger (extra macros, etc.)
Alternative
For every context-sensitive function we could add a new first parameter, "context", that provides the runtime context to use. That would be something like this:
struct { PyThreadState *tstate; ... } PyRuntimeContext;
The interpreter state and global runtime state would still be accessible via the same indirection we have now.
Taking this alternative would eliminate the previous costs. Having a consistent "PyRuntimeContext *context" first parameter would maintain the easy distinction from runtime-dependent functions. Asking callers to pass in the context explicitly is probably better regardless. As to backward compatibility, we could maintain a shim to bridge between the old way and the new.
About the C-global _PyRuntime
Currently the global runtime state (_PyRuntimeState) is stored in a static global C variable, _PyRuntime. I added it at the time I consolidated many of the existing C globals into a single struct. Having a C global makes it easy to do the wrong thing, so it may be good to do something else.
That would mean allocating a _PyRuntimeState on the heap early in startup and pass that around where needed. I expect that would not have any meaningful performance penalty. It would probably also simplify some of the code we currently use to manage _PyRuntime correctly.
As a bonus, this would be important if we decided that multiple-runtimes-per-process were a desirable thing. That's a neat idea, though I don't see a need currently. So on its own it's not really a justification for dropping a static _PyRuntime. :) However, I think the other reasons are enough.
Conclusions
This issue has a specific objective that I think is premature. We have an existing pattern and we should stick with that until we decide to change to a new pattern. That said, a few things should get corrected and we should investigate alternative patterns for the context-sensitive C-API.
As to getting rid of the _PyRuntime global variable in favor of putting it on the heap, I'm not opposed. However, doing so should probably be handled in a separate issue.
Here are my thoughts on actionable items:
- look for a better pattern for the context-sensitive C-API
- clearly document which of the 3 groups each C-API function belongs to
- add a "runtime" field to the PyInterpreterState pointing to the parent _PyRuntimeState
- (maybe) add a _PyRuntimeState_GET() macro, a la PyThreadState_GET()
- for context-sensitive C-API that uses the global runtime state, get it from the current PyInterpreterState
- for runtime-dependent C-API that targets the global runtime state, ensure the _PyRuntimeState is always an explicit parameter
- (maybe) drop _PyRuntime and create a _PyRuntimeState on the heap during startup to pass around
A response from Jeroen Demeyer (https://bugs.python.org/msg340971):
Changing *every* C API function to include a state parameter looks very cumbersome. Another alternative would be to store the interpreter state in every Python object (or every class, that would be sufficient). That way, you would only need to pass context to C API functions which do not take a Python object as argument.
A response from Steve Dower (https://bugs.python.org/msg340986):
Changing every API to take the context parameter would bring us into alignment with the JavaScript VMs.
I'm working on a project that embeds a few of these, as well as Python, and our thread management is much worse than their context parameter. Though I'm of course very sympathetic to the compatibility argument (but then the shims would just load the context from TSS and pass it around, so they're not too bad).
Eric's breakdown of context scopes seems spot on, and it means that we only really need the thread state to be passed around. The few places that would be satisfied by runtime state now (GIL, GC) should become interpreter state, which is most easily found from a thread state anyway.
Runtime state should eventually probably become runtime configuration (those settings we need to create interpreters) and a minimum amount of state to track live interpreters. I see no reason to pass it around anywhere other than interpreter creation, and as a transitional step toward that goal it should be accessible through the active interpreter state.
Oops, I responded to Issue 36710 but I should have posted my comment here. Discussion is better done on this list since we are talking about an entirely new C-API.
===========================================================================
I think there are two questions to answer. First, do we want to support multiple runtimes per process? Second, if we do, what is the best way to do that? Some people would argue that multiple runtimes are not needed or are too hard to do. Maybe they are correct, I'm not sure. We should try to get a consensus on that first question.
If we do decide to do it, then we need to answer the second question. Passing a "context" argument around seems the best solution. That is how the Java JNI does it. It sounds like that's how Javascript VMs do it too. We don't need to get creative. Look at what other VMs do and copy the best idea.
If we do decide to do it, evolving the codebase and all extension modules is going to be a massive task. I would imagine that we can have a backwards compatible API layer that uses TSS. The layer that passes context explicitly would still have to maintain the TSS. There could be a build option that turns that backwards compatibility on or off. If off, you would gain some performance advantage because TSS does not have to be kept up-to-date.
My feeling right now that even though this is a massive job, it is the correct thing to do. CPUs continue to gain cores. Improving CPython's ability to do multi-threading and multi-processing should be a priority for CPython core developers.
On 2019-04-27, Eric Snow wrote:
Alternative
For every context-sensitive function we could add a new first parameter, "context", that provides the runtime context to use. That would be something like this:
struct { PyThreadState *tstate; ... } PyRuntimeContext;
The interpreter state and global runtime state would still be accessible via the same indirection we have now.
This should be an opaque structure, IMHO. Users of the API should not know what's inside of it. We could have inline functions to get various things out of it, e.g.
PyThreadState *PyRuntime_GetThreadState(PyRuntimeContext *ctx) _PyRuntimeState *PyRuntime_GetRuntimeState(PyRuntimeContext *ctx)
Can we just add a _PyRuntimeState pointer to PyThreadState and then PyThreadState is basically the same as PyRuntimeContext? I don't see why _PyRuntimeState needs to be heap allocated at this point. Set the pointer to _PyRuntimeState when we create a new PyThreadState.
Regards,
Neil
On Fri, May 3, 2019 at 5:14 AM Neil Schemenauer <nas-python@arctrix.com> wrote:
If we do decide to do it, then we need to answer the second question. Passing a "context" argument around seems the best solution. That is how the Java JNI does it. It sounds like that's how Javascript VMs do it too. We don't need to get creative. Look at what other VMs do and copy the best idea.
How often does an extension / native method need to know what the current context is? I'd be happy to see global variables changed to accessor function calls. Not so happy if there's a context arg that has to be passed everywhere, even if I don't care about it.
(Old timers might remember the Xlib API. The first argument to every function was an XDisplay * dpy. 99.99% of applications only ran on a single display (1+ screens, keyboard, and mouse) but we had to store the display and pass it everywhere.)
Functions have arguments because the value might change. So if my extension / native method takes a context argument, is it OK for me to pass a different one internally? Something like
PyObject * Py_foo_function(PyContext * context, ...) { PyObject * bar = Py_different_function(some_other_context, ...)
This could happen deliberately or by accident. What kind of expectations will there be on changing the context argument?
--
cheers,
Hugh Fisher
[Resending from the right address, sorry for dupes]
On May 3, 2019, at 5:16 AM, Hugh Fisher <hugo.fisher@gmail.com> wrote:
On Fri, May 3, 2019 at 5:14 AM Neil Schemenauer <nas-python@arctrix.com> wrote:
If we do decide to do it, then we need to answer the second question. Passing a "context" argument around seems the best solution. That is how the Java JNI does it. It sounds like that's how Javascript VMs do it too. We don't need to get creative. Look at what other VMs do and copy the best idea.
How often does an extension / native method need to know what the current context is? I'd be happy to see global variables changed to accessor function calls. Not so happy if there's a context arg that has to be passed everywhere, even if I don't care about it.
The problem is: where would the accessor function get the state? A global variable or TLS. So you haven’t solved the problem.
(Old timers might remember the Xlib API. The first argument to every function was an XDisplay * dpy. 99.99% of applications only ran on a single display (1+ screens, keyboard, and mouse) but we had to store the display and pass it everywhere.)
Functions have arguments because the value might change. So if my extension / native method takes a context argument, is it OK for me to pass a different one internally?
I don’t see why not.
And we can’t stop extensions from doing the wrong thing, like storing the context in a global variable. But such extensions will fail under certain scenarios involving multiple interpreters in a process.
Eric
Something like
PyObject * Py_foo_function(PyContext * context, ...) { PyObject * bar = Py_different_function(some_other_context, ...)
This could happen deliberately or by accident. What kind of expectations will there be on changing the context argument?
--
cheers, Hugh Fisher
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org
On 2019-05-03, Eric V. Smith wrote:
The problem is: where would the accessor function get the state? A global variable or TLS. So you haven’t solved the problem.
What specifically is the problem with using TLS? Is it only due to performance? I recall that Dino Viehland mentioned that things can be done to make using TLS faster. This is out of my depth but it seems like the overhead should be fairly small. For a language like Python, maybe that overhead is small enough that we don't want to go through the pain of explicitly passing context.
TLS mechanisms are platform specific. Based on a tiny bit of reading, on Linux x86-64, the thread local region is stored in the GS CPU register. That would be really low overhead.
Looking at current CPython, I see this:
#define _PyThreadState_GET()
((PyThreadState*)_Py_atomic_load_relaxed(&_PyRuntime.gilstate.tstate_current))
So it would seem that Python doesn't use the optimized TLS as provided by the plaform ABI.
Regards,
Neil
[Again apologies for dupes. I really need to fix that]
On 5/3/19 2:53 PM, Neil Schemenauer wrote:
On 2019-05-03, Eric V. Smith wrote:
The problem is: where would the accessor function get the state? A global variable or TLS. So you haven’t solved the problem.
What specifically is the problem with using TLS? Is it only due to performance? I recall that Dino Viehland mentioned that things can be done to make using TLS faster. This is out of my depth but it seems like the overhead should be fairly small. For a language like Python, maybe that overhead is small enough that we don't want to go through the pain of explicitly passing context.
I'm just saying that if passing around a parameter is the solution to not using global state or TLS, then having a parameter-less accessor function to that state doesn't solve the problem: you're still using global state or TLS.
I think global state is a non-starter, of course. I could live with TLS. I don't know its performance relative to passing a parameter to all functions.
Eric
TLS mechanisms are platform specific. Based on a tiny bit of reading, on Linux x86-64, the thread local region is stored in the GS CPU register. That would be really low overhead.
Looking at current CPython, I see this:
#define _PyThreadState_GET()
((PyThreadState*)_Py_atomic_load_relaxed(&_PyRuntime.gilstate.tstate_current))So it would seem that Python doesn't use the optimized TLS as provided by the plaform ABI.
Regards,
Neil
On 2019-05-03, Eric V. Smith wrote:
I'm just saying that if passing around a parameter is the solution to not using global state or TLS, then having a parameter-less accessor function to that state doesn't solve the problem: you're still using global state or TLS.
Thanks for the clarity. So the options seem to be:
A) don't support multiple interpreters per process
B) use TLS or some similar mechanism
C) explicitly pass a context pointer to every single CPython API
It's likely we all agree that A is not what we want. We are already trying to support multiple interpreters, even though the current implementation has a number of issues. I guess we could decide "forget it" and rip that all out. Eric Snow would be sad. ;-P
Regarding B vs C, is the only reason to prefer C due to performance? If so, I think it is not worth the pain of it. Based on my small amount of knowledge, TLS can be really cheap. I found this article that has some benchmarks:
https://david-grs.github.io/tls_performance_overhead_cost_linux/
Again, TLS is a platform specific mechanism so maybe it is worse on Windows, for example. However, it seems to me that there should be some way to do it with relatively small overhead.
Passing context everywhere doesn't come for free either. You are probably using up another register (or pushing something on the stack, depending on the ABI). If you don't need the context often and you don't switch threads often, option B is likely faster.
That's assuming that we can make the plaform optimized TLS work for how CPython uses PyThreadState. In currrent CPython, we don't use it and instead to atomic writes to memory. E.g. _PyThreadState_Swap().
On Sat, 4 May 2019 at 06:54, Neil Schemenauer <nas-python@arctrix.com> wrote:
Thanks for the clarity. So the options seem to be:
A) don't support multiple interpreters per process
B) use TLS or some similar mechanism
C) explicitly pass a context pointer to every single CPython API
It's likely we all agree that A is not what we want. We are already trying to support multiple interpreters, even though the current implementation has a number of issues. I guess we could decide "forget it" and rip that all out. Eric Snow would be sad. ;-P
Regarding B vs C, is the only reason to prefer C due to performance? If so, I think it is not worth the pain of it. Based on my small amount of knowledge, TLS can be really cheap. I found this article that has some benchmarks:
https://david-grs.github.io/tls_performance_overhead_cost_linux/
Again, TLS is a platform specific mechanism so maybe it is worse on Windows, for example. However, it seems to me that there should be some way to do it with relatively small overhead.
Passing context everywhere doesn't come for free either. You are probably using up another register (or pushing something on the stack, depending on the ABI). If you don't need the context often and you don't switch threads often, option B is likely faster.
I've discussed this with Eric Snow a bit here at PyCon, and I take the view that trying to migrate away from relying on TLS at this late date isn't going to be worth the hassle.
Firstly, extension modules need an implicit context no matter what, and if an implicit context is available, it's far more user friendly if CPython just uses it rather than requiring the extension module to retrieve it from wherever it is stored and then pass it in.
Within CPython, we'd also need to change every API that runs Python code, and every API that may implicitly call back in to Python code, to accept a context parameter.
And for embedding applications, having two ways to do it (ThreadState_Swap vs whatever the new way looks like) will mostly be annoying churn between two functionally equivalent ways of doing things rather than helpful enablement of any actually new functionality.
There may be some new APIs where it makes sense for us to define them as working on a supplied thread state rather than being dependent on the active one, but overall we shouldn't we attempting to change the way we handle the active context.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 04May2019 1353, Nick Coghlan wrote:
On Sat, 4 May 2019 at 06:54, Neil Schemenauer <nas-python@arctrix.com> wrote:
Regarding B vs C, is the only reason to prefer C due to performance? If so, I think it is not worth the pain of it. Based on my small amount of knowledge, TLS can be really cheap. I found this article that has some benchmarks:
https://david-grs.github.io/tls_performance_overhead_cost_linux/
Again, TLS is a platform specific mechanism so maybe it is worse on Windows, for example. However, it seems to me that there should be some way to do it with relatively small overhead.
Passing context everywhere doesn't come for free either. You are probably using up another register (or pushing something on the stack, depending on the ABI). If you don't need the context often and you don't switch threads often, option B is likely faster.
In embedding situations the programming model using parameters will be significantly simpler than using TLS, particularly when embedding Python into an existing app that already has a thread model. Assuming you can trivially stash the context pointer in your existing non-Python thread state, it is very easy to handle callbacks into Python. Without this, you end up storing the Python thread state itself and trying to carefully switch it back in by mutating state *outside* of your own managed state, and you run into consistency issues very quickly (I was working on doing this the week before PyCon).
It's a bit like the difference between threading.local and contextvars, and how the former carries across async call stacks while the latter are far more like "call-stack locals".
But...
I've discussed this with Eric Snow a bit here at PyCon, and I take the view that trying to migrate away from relying on TLS at this late date isn't going to be worth the hassle.
I agree, but just want to point out (for when people ask in the future) that we agreed to stick with TLS for compatibility reasons and not because it's inherently better :)
Cheers, Steve
On Sat, 11 May 2019 at 00:27, Steve Dower <steve.dower@python.org> wrote:
On 04May2019 1353, Nick Coghlan wrote:
I've discussed this with Eric Snow a bit here at PyCon, and I take the view that trying to migrate away from relying on TLS at this late date isn't going to be worth the hassle.
I agree, but just want to point out (for when people ask in the future) that we agreed to stick with TLS for compatibility reasons and not because it's inherently better :)
Yeah, if we were starting from scratch today, a ubiquitous explicit context arg would potentially be preferable for a whole host of reasons. It's only the fact that we have to make the TLS approach work *anyway* for the sake of the existing API that means it isn't worthwhile to also pursue the explicit parameter approach at the level of the public API.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Hi,
Sorry, I was super busy this month and so I wasn't able to reply to this thread earlier :-( I wrote notes of the quick chat we had with Eric Snow, Nick Coghlan and Pablo Galindo Salgado at during Pycon sprints:
https://pythoncapi.readthedocs.io/runtime.html
- We want to be able to run multiple interpreters at the same time: reduce shared state, have one "interpreter lock" per interpreter (remove the *Global* Interpreter Lock, GIL)
- Currently, _PyRuntime variable is the root to get the current Python thread state. IMHO that has to change... but I'm not sure how :-) At least, each interpreter must have its private "gilstate".
- IMHO the Python thread state "tstate" must be the "root" to access basically everything: get the interpreter state (tstate->interp), get a module state, get the current exception, etc.
- The **public** C API must not be modified
- We can add new internal C API where we can pass as much "context" as we want
I started to modify Python **internals** to pass "runtime" and "tstate" in core files: pystate.c, ceval.c, pyerrors.c, signalmodule.c, pylifecycle.c, coreconfig.c, gcmodule.c, ...
=> https://bugs.python.org/issue36710
IMHO the "runtime" must be replaced with "tstate" in the long term. That's why I started to not pass "runtime" but directly pointers to fields inside runtime, so these functions will not have to be modified. Example:
void _PyEval_SignalReceived(struct _ceval_runtime_state *ceval);
Function called in signalmodule.c using:
_PyRuntimeState *runtime = &_PyRuntime; _PyEval_SignalReceived(&runtime->ceval);
If tomorrow, ceval is moved to PyInterpreterState, only signalmodule.c will have to be modified: not signalmodule.c
... Maybe passing directly "ceval" is too specific, and passing "tstate" everything will be fine in the future.
_PyRuntime should only be used to store the list of interpreters and anything to communicate between interpreters.
Victor
On 24May2019 0825, Victor Stinner wrote:
I started to modify Python **internals** to pass "runtime" and "tstate" in core files: pystate.c, ceval.c, pyerrors.c, signalmodule.c, pylifecycle.c, coreconfig.c, gcmodule.c, ...
This is why some of us have been advocating for a new structure - "PyContext" or similar - that can contain the information needed.
Though I see you've gone ahead and modified the APIs already without waiting for discussion or review, so I guess this is how CPython works now *shrug*
Cheers, Steve
Le ven. 24 mai 2019 à 18:14, Steve Dower <steve.dower@python.org> a écrit :
On 24May2019 0825, Victor Stinner wrote:
I started to modify Python **internals** to pass "runtime" and "tstate" in core files: pystate.c, ceval.c, pyerrors.c, signalmodule.c, pylifecycle.c, coreconfig.c, gcmodule.c, ...
This is why some of us have been advocating for a new structure - "PyContext" or similar - that can contain the information needed.
What do you plan to put in such PyContext?
Though I see you've gone ahead and modified the APIs already without waiting for discussion or review, so I guess this is how CPython works now *shrug*
Well, we had a discussion at Pycon and we agreed to add new parameters to pass a "context". But we didn't define how the context would look like. We also agreed to not touch the public API, only internals. As soon as it's internal, we are free to modify whenever.
I went ahead to "discover" through the code what we need. So far, it seems like "it depends" :-) Each function has different needs.
Victor
Night gathers, and now my watch begins. It shall not end until my death.
On 24May2019 1018, Victor Stinner wrote:
Le ven. 24 mai 2019 à 18:14, Steve Dower <steve.dower@python.org> a écrit :
On 24May2019 0825, Victor Stinner wrote:
I started to modify Python **internals** to pass "runtime" and "tstate" in core files: pystate.c, ceval.c, pyerrors.c, signalmodule.c, pylifecycle.c, coreconfig.c, gcmodule.c, ...
This is why some of us have been advocating for a new structure - "PyContext" or similar - that can contain the information needed.
What do you plan to put in such PyContext?
We hadn't figured that out yet, but given that Yury found we needed contextvars because thread locals weren't sufficient after many years, I don't want to assume that the thread state is sufficient.
If we always pass the context struct by pointer, we can even expand its contents without breaking existing code. It just seems like a good engineering design to allow for future growth here.
Though I see you've gone ahead and modified the APIs already without waiting for discussion or review, so I guess this is how CPython works now *shrug*
Well, we had a discussion at Pycon and we agreed to add new parameters to pass a "context". But we didn't define how the context would look like. We also agreed to not touch the public API, only internals. As soon as it's internal, we are free to modify whenever.
Only from a forwards/backwards compatibility point of view, not for checking in code without review or discussion. Especially when you know there's a group of people interested and actively trying to participate in this area.
I went ahead to "discover" through the code what we need. So far, it seems like "it depends" :-) Each function has different needs.
Right, but for the most part, it's all going to come via the thread state, since that's our compatibility restriction. So probably best to pass that around consistently than try and select the minimal part of the interface for each function.
Or start with the PyContext struct that only contains the current thread state, and if we find reasons to move some things to the context level rather than the thread level, we have the ability to do that without breaking even ourselves.
Cheers, Steve
Le ven. 24 mai 2019 à 19:29, Steve Dower <steve.dower@python.org> a écrit :
What do you plan to put in such PyContext?
We hadn't figured that out yet, but given that Yury found we needed contextvars because thread locals weren't sufficient after many years, I don't want to assume that the thread state is sufficient.
If we always pass the context struct by pointer, we can even expand its contents without breaking existing code. It just seems like a good engineering design to allow for future growth here.
If we pass a structure by copy, the structure should be copied to every call. It might have a negative impact on performances, especially if the structure contains multiple fields.
If we pass the structure by reference (pointer), it would add yet another indirection to every memory access to the context. It can also have an impact on performance. By the way, my first implementation of _PyBytesWriter was to put everything in a single structure, like OOP. But it was way more inefficient. I had hard time to understand why simply moving pointers into registers made the code way more inefficient. It's something about aliasing... I took some notes there:
https://vstinner.github.io/pybyteswriter.html https://bugs.python.org/issue17742 https://gcc.gnu.org/ml/gcc-help/2013-04/msg00192.html
Since this bad experience, I'm less excited by structures for hot code path :-(
--
In Python 3.7, PyThreadState_GET() is implemented as: _Py_atomic_load_relaxed(&_PyThreadState_Current).
In Python 3.8, it's implemented as _Py_atomic_load_relaxed(&_PyRuntime.gilstate.tstate_current).
_Py_atomic_load_relaxed() should be efficient, but it's called very frequently and it requires memory fences. I expect that not reading an atomic variable by passing directly 'tstate' is more efficient.
Victor
Night gathers, and now my watch begins. It shall not end until my death.
participants (7)
-
Eric Snow
-
Eric V. Smith
-
Hugh Fisher
-
Neil Schemenauer
-
Nick Coghlan
-
Steve Dower
-
Victor Stinner