Pass the Python thread state to internal C functions

Hi, Are you ok to modify internal C functions to pass explicitly tstate? -- I started to modify internal C functions to pass explicitly "tstate" when calling C functions: the Python thread state (PyThreadState). Example of C code (after my changes): if (_Py_EnterRecursiveCall(tstate, " while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); _Py_LeaveRecursiveCall(tstate); return _Py_CheckFunctionResult(tstate, callable, result, NULL); In Python 3.8, the tstate is implicit: if (Py_EnterRecursiveCall(" while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); Py_LeaveRecursiveCall(); return _Py_CheckFunctionResult(callable, result, NULL); There are different reasons to pass explicitly tstate, but my main motivation is to rework Python code base to move away from implicit global states to states passed explicitly, to implement the PEP 554 "Multiple Interpreters in the Stdlib". In short, the final goal is to run multiple isolated Python interpreters in the same process: run pure Python code on multiple CPUs in parallel with a single process (whereas multiprocessing runs multiple processes). Currently, subinterpreters are a hack: they still share a lot of things, the code base is not ready to implement isolated interpreters with one "GIL" (interpreter lock) per interpreter, and to run multiple interpreters in parallel. Many _PyRuntimeState fields (the global _PyRuntime variable) should be moved to PyInterpreterState (or maybe PyThreadState): per interpreter. Another simpler but more annoying example are Py_None and Py_True singletons which are globals. We cannot share these singletons between interpreters because updating their reference counter would be a performance bottleneck. If we put a "superglobal-GIL" to ensure that Py_None reference counter remains consistent, it would basically "serialize" all threads, rather than running them in parallel. The idea of passing tstate to internal C functions is to prepare code to get the per-interpreter None from tstate. tstate is basically the "root" to access all states which are per interpreter. For example, PyInterpreterState can be read from tstate->interp. Right now, tstate is only passed to a few functions, but you should expect to see it passed to way more functions later, once more structures will be moved to PyInterpreterState. -- On my latest merged PR 17052 ("Add _PyObject_VectorcallTstate()"), Mark Shannon wrote: "I don't see how this could ever be faster, nor do I see how it is more correct." https://github.com/python/cpython/pull/17052#issuecomment-552538438 Currently, tstate is get using these internal APIs: #define _PyRuntimeState_GetThreadState(runtime) \ ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current)) #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime) or using public APIs: PyAPI_FUNC(PyThreadState *) PyThreadState_Get(void); #define PyThreadState_GET() PyThreadState_Get() I dislike _PyThreadState_GET() for 2 reasons: * it relies on the _PyRuntime global variable: I would prefer to avoid global variables * it uses an atomic operation which can become a perofrmance issue when more and more code will require tstate -- An alternative would be to use PyGILState_GetThisThreadState() which uses a thread local state (TLS) variable to get the Python thread state ("tstate"), rather that _PyRuntime atomic variable. Except that the PyGILState API doesn't support subinterpreters yet :-( https://bugs.python.org/issue15751 "Support subinterpreters in the GIL state API" is open since 2012. Note: While the GIL is released, _PyThreadState_GET() is NULL, whereas PyGILState_GetThisThreadState() is non-NULL. -- Links: * https://pythoncapi.readthedocs.io/runtime.html : my notes on moving globals to per interpreter states * https://bugs.python.org/issue36710 * https://bugs.python.org/issue38644 Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 2019-11-12 23:03, Victor Stinner wrote:
Hi,
Are you ok to modify internal C functions to pass explicitly tstate?
In short, yes, but: - don't make things slower :) - don't break the public API or the stable ABI I'm a fan of explicitly passing state everywhere, rather than keeping it in "global" variables. Currently, surprisingly many internal functions do a PyThreadState_GET for themselves, then call another function that does the same. That's wasteful, but impossible to change in the public API. Your changes (of which I only saw a very limited subset) seem to follow a simple rule: public API functions call PyThreadState_GET, and then call internal functions that pass it around. That's sounds beautifully easy to explain! Later, we'll just need to find a way to make the tstate API public (and opt-in). The "per-interpreter None", however, is a different issue. I don't see how that can be done without breaking the stable ABI. I still think immortal immutable objects could be shared across interpreters.
--
I started to modify internal C functions to pass explicitly "tstate" when calling C functions: the Python thread state (PyThreadState). Example of C code (after my changes):
if (_Py_EnterRecursiveCall(tstate, " while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); _Py_LeaveRecursiveCall(tstate); return _Py_CheckFunctionResult(tstate, callable, result, NULL);
In Python 3.8, the tstate is implicit:
if (Py_EnterRecursiveCall(" while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); Py_LeaveRecursiveCall(); return _Py_CheckFunctionResult(callable, result, NULL);
There are different reasons to pass explicitly tstate, but my main motivation is to rework Python code base to move away from implicit global states to states passed explicitly, to implement the PEP 554 "Multiple Interpreters in the Stdlib". In short, the final goal is to run multiple isolated Python interpreters in the same process: run pure Python code on multiple CPUs in parallel with a single process (whereas multiprocessing runs multiple processes).
Currently, subinterpreters are a hack: they still share a lot of things, the code base is not ready to implement isolated interpreters with one "GIL" (interpreter lock) per interpreter, and to run multiple interpreters in parallel. Many _PyRuntimeState fields (the global _PyRuntime variable) should be moved to PyInterpreterState (or maybe PyThreadState): per interpreter.
Another simpler but more annoying example are Py_None and Py_True singletons which are globals. We cannot share these singletons between interpreters because updating their reference counter would be a performance bottleneck. If we put a "superglobal-GIL" to ensure that Py_None reference counter remains consistent, it would basically "serialize" all threads, rather than running them in parallel.
The idea of passing tstate to internal C functions is to prepare code to get the per-interpreter None from tstate.
tstate is basically the "root" to access all states which are per interpreter. For example, PyInterpreterState can be read from tstate->interp.
Right now, tstate is only passed to a few functions, but you should expect to see it passed to way more functions later, once more structures will be moved to PyInterpreterState.
--
On my latest merged PR 17052 ("Add _PyObject_VectorcallTstate()"), Mark Shannon wrote: "I don't see how this could ever be faster, nor do I see how it is more correct." https://github.com/python/cpython/pull/17052#issuecomment-552538438
Currently, tstate is get using these internal APIs:
#define _PyRuntimeState_GetThreadState(runtime) \ ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current)) #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)
or using public APIs:
PyAPI_FUNC(PyThreadState *) PyThreadState_Get(void); #define PyThreadState_GET() PyThreadState_Get()
I dislike _PyThreadState_GET() for 2 reasons:
* it relies on the _PyRuntime global variable: I would prefer to avoid global variables * it uses an atomic operation which can become a perofrmance issue when more and more code will require tstate
--
An alternative would be to use PyGILState_GetThisThreadState() which uses a thread local state (TLS) variable to get the Python thread state ("tstate"), rather that _PyRuntime atomic variable. Except that the PyGILState API doesn't support subinterpreters yet :-(
https://bugs.python.org/issue15751 "Support subinterpreters in the GIL state API" is open since 2012.
Note: While the GIL is released, _PyThreadState_GET() is NULL, whereas PyGILState_GetThisThreadState() is non-NULL.
--
Links:
* https://pythoncapi.readthedocs.io/runtime.html : my notes on moving globals to per interpreter states * https://bugs.python.org/issue36710 * https://bugs.python.org/issue38644
Victor

On 11/12/19 2:03 PM, Victor Stinner wrote:
Hi,
Are you ok to modify internal C functions to pass explicitly tstate?
I did exactly that in the Gilectomy prototype. Pulling it out of TLS was too slow, and storing it in a global wouldn't work with multiple actually-concurrent threads. //arry/

Le mer. 13 nov. 2019 à 14:28, Larry Hastings <larry@hastings.org> a écrit :
I did exactly that in the Gilectomy prototype. Pulling it out of TLS was too slow,
What do you mean? Getting tstate from a TLS was a performance bottleneck by itself? Reading a TLS variable seems to be quite efficient. Mark Shannon wrote: "The current means of accessing the thread state does seem rather convoluted, whereas accessing from a thread local is quite efficient (at least with GCC) https://godbolt.org/z/z-vNPN " https://github.com/python/cpython/pull/17052#issuecomment-552538438 Copy of his C code: """ extern __thread int extern_tl; int get_extern_thread_local(void) { return extern_tl; } __thread int tl; int get_thread_local(void) { return tl; } """ And the generated assembly (by godbolt.org service): """ get_extern_thread_local(): mov rax, QWORD PTR extern_tl@gottpoff[rip] mov eax, DWORD PTR fs:[rax] ret get_thread_local(): mov eax, DWORD PTR fs:tl@tpoff ret tl: .zero 4 """ TLS variable read is basically one or two MOV in the Intel x86 assembly (using GCC 9.2). With a friend, I looked at the assembly to read and write atomic variables. In short, only the write requires a memory fence, whereas the read is basically just a MOV (again, in Intel x86). #define _PyRuntimeState_GetThreadState(runtime) \ ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current)) #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime) _PyThreadState_GET() uses "_Py_atomic_load_relaxed". I'm not used to C99 atomic conventions. The "memory_order_relaxed" documentation says: "Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed (see Relaxed ordering below)" Note: I'm not even sure why Python currently uses an atomic operation. Not why just a regular global variable? By if we change something, I would prefer to move to a TLS variable instead, to support subinterpreters. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 11/13/19 5:52 AM, Victor Stinner wrote:
Le mer. 13 nov. 2019 à 14:28, Larry Hastings <larry@hastings.org> a écrit :
I did exactly that in the Gilectomy prototype. Pulling it out of TLS was too slow, What do you mean? Getting tstate from a TLS was a performance bottleneck by itself? Reading a TLS variable seems to be quite efficient.
I'm pretty sure you understand the sentence "Pulling it out of TLS was too slow". At the time CPython used the POSIX APIs for accessing thread local storage, and I didn't know about and therefore did not try this "__thread" GCC extension. I do remember trying some other API that was purported to be faster--maybe a GCC library function for faster TLS access?--but I didn't get that to work either before I gave up on it out of frustration. Also, I dimly recall that I moved several things from globals into the ThreadState structure, and probably added one or two of my own. So nearly every function call was referencing ThreadState at one point or another. Passing it as a parameter was a definite win over calling the POSIX TLS APIs. I also took the opportunity to pass my "reference count manager" data as a separate parameter, which again was per-thread and again was a major win at the time. //arry/

Le jeu. 14 nov. 2019 à 04:55, Larry Hastings <larry@hastings.org> a écrit :
I'm pretty sure you understand the sentence "Pulling it out of TLS was too slow". At the time CPython used the POSIX APIs for accessing thread local storage, and I didn't know about and therefore did not try this "__thread" GCC extension. I do remember trying some other API that was purported to be faster--maybe a GCC library function for faster TLS access?--but I didn't get that to work either before I gave up on it out of frustration.
I asked for confirmation, since I was surprised. But when I looked at assembly with my friend, we played with __thread not with pthread_getspecific(). So thanks for confirming that "getting tstate" can be a performance bottleneck: that's a very good reason to pass it explicitly.
I also took the opportunity to pass my "reference count manager" data as a separate parameter, which again was per-thread and again was a major win at the time.
Another approach would be to pass a "PyContext*" pointer which contains tstate, but also additional fields. But I chose to state with a direct "PyThreadState* tstate" to avoid one indirection to every tstate access. Currently, tstate seems to be enough for the current code base. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 13Nov2019 1954, Larry Hastings wrote:
On 11/13/19 5:52 AM, Victor Stinner wrote:
Le mer. 13 nov. 2019 à 14:28, Larry Hastings<larry@hastings.org> a écrit :
I did exactly that in the Gilectomy prototype. Pulling it out of TLS was too slow, What do you mean? Getting tstate from a TLS was a performance bottleneck by itself? Reading a TLS variable seems to be quite efficient.
I'm pretty sure you understand the sentence "Pulling it out of TLS was too slow". At the time CPython used the POSIX APIs for accessing thread local storage, and I didn't know about and therefore did not try this "__thread" GCC extension. I do remember trying some other API that was purported to be faster--maybe a GCC library function for faster TLS access?--but I didn't get that to work either before I gave up on it out of frustration.
Also, I dimly recall that I moved several things from globals into the ThreadState structure, and probably added one or two of my own. So nearly every function call was referencing ThreadState at one point or another. Passing it as a parameter was a definite win over calling the POSIX TLS APIs.
Passing it as a parameter is also a huge win for embedders, as it gets very complicated to merge locking/threading models when the host application has its own requirements. Overall, I'm very supportive of passing context through parameters rather than implicitly through TLS. (Though we've got a long way to go before it'll be possible for embedders to not be held hostage by CPython's threading model... one step at a time! :) ) Cheers, Steve

On Wed, 13 Nov 2019 14:52:32 +0100 Victor Stinner <vstinner@python.org> wrote:
#define _PyRuntimeState_GetThreadState(runtime) \ ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current)) #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)
_PyThreadState_GET() uses "_Py_atomic_load_relaxed". I'm not used to C99 atomic conventions. The "memory_order_relaxed" documentation says:
"Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed (see Relaxed ordering below)"
Note: I'm not even sure why Python currently uses an atomic operation.
Is it protected by a lock? If not, you need to use an atomic. Since it's theoretically possible to read the current thread state without the GIL held (though not very useful), then an atomic is required. Regards Antoine.

On Thu, Nov 14, 2019, at 07:43, Antoine Pitrou wrote:
On Wed, 13 Nov 2019 14:52:32 +0100 Victor Stinner <vstinner@python.org> wrote:
#define _PyRuntimeState_GetThreadState(runtime) \ ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current)) #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)
_PyThreadState_GET() uses "_Py_atomic_load_relaxed". I'm not used to C99 atomic conventions. The "memory_order_relaxed" documentation says:
"Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed (see Relaxed ordering below)"
Note: I'm not even sure why Python currently uses an atomic operation.
Is it protected by a lock? If not, you need to use an atomic. Since it's theoretically possible to read the current thread state without the GIL held (though not very useful), then an atomic is required.
It sounds like you are saying PyRuntimeState_GetThreadState has two duties, then: "get this thread's thread state" (from the GIL holder - how do other threads get their own thread state), and "get the GIL-holding thread's thread state (from non-GIL holder thread). The former shouldn't need atomic/overhead locking (unless the thread state can be written from other threads), even if the latter does.

On 11/12/2019 5:03 PM, Victor Stinner wrote:
Hi,
Are you ok to modify internal C functions to pass explicitly tstate?
The last time we discussed this, there was pushback due to performance concerns. I don't recall if that was actually measured, or just a vague unease. I've long advocated (mostly to myself, and Larry when he would listen!) that we should do this. I agree with Petr that not breaking existing APIs is of course critical. A parallel set of APIs is needed. But the existing APIs should become thin wrappers, until Python 5000 (aka never) when they can go away. And this not only helps with being explicit, it should help with testing. No more depending on some hidden global state. Eric

Petr, Eric: sure, my question is only about the internal C functions. I have no plan to change the existing C API. Le mer. 13 nov. 2019 à 14:52, Eric V. Smith <eric@trueblade.com> a écrit :
The last time we discussed this, there was pushback due to performance concerns. I don't recall if that was actually measured, or just a vague unease.
Maybe I was the one who raised a concern about the atomic variable performance. But I never ran a benchmark on that.
I agree with Petr that not breaking existing APIs is of course critical. A parallel set of APIs is needed. But the existing APIs should become thin wrappers, until Python 5000 (aka never) when they can go away.
There is a project of a new C API for Python: https://github.com/pyhandle/hpy I suggested to add a mandatory "context" parameter since day 1. See the current API draft, it has a "ctx" argument: https://github.com/pyhandle/hpy/blob/3266dc295b0be20b41c99f4f4e944d117b3fc87... Example: "HPy v = HPy_Something(ctx);" Victor -- Night gathers, and now my watch begins. It shall not end until my death.

I wouldn't worry too much about the the Singletons in this issue; they could be solved in any of several ways, all of which would be improvements conceptually -- if performance and backwards compatibility were resolved. In theory, the incr/decr pair should be delegated to the memory store, with Petr's suggestion of immortal immutables being one example. The catch is that the current scheme is really fast in the normal case; even hardcoding just True/False/None to magic addresses might be slower. You don't have to solve that just to speed up access to state variables that are not exposed directly to python code.

On Wed., 13 Nov. 2019, 8:06 am Victor Stinner, <vstinner@python.org> wrote:
Hi,
Are you ok to modify internal C functions to pass explicitly tstate?
I'll join the chorus of +1's. With the work you've already done to clearly separate the public APIs from the internal ones, it's now much clearer which functions should be accepting an explicit thread state, and which ones should be looking it up implicitly. Cheers, Nick.

Victor Stinner schrieb am 12.11.19 um 23:03:
Are you ok to modify internal C functions to pass explicitly tstate?
FWIW, I started doing the same internally in Cython a while back, because like others, I also considered it wasteful to look it up all over the place, often multiple times inside of one function (usually related to try-finally and exception handling). I think it similarly makes sense inside of CPython. I would also find it reasonable to make it part of a new C-API. Stefan

On Tue, Nov 12, 2019 at 3:11 PM Victor Stinner <vstinner@python.org> wrote:
Are you ok to modify internal C functions to pass explicitly tstate?
I'm also in favor (strongly)! (no surprises there) The only concerns I've heard is that on some platforms there is a measurable overhead once you hit a threshold of a specific small number of parameters. Adding this extra parameter will put some functions over that threshold. I don't have any more information than that.
There are different reasons to pass explicitly tstate, but my main motivation is to rework Python code base to move away from implicit global states to states passed explicitly, to implement the PEP 554 "Multiple Interpreters in the Stdlib". In short, the final goal is to run multiple isolated Python interpreters in the same process: run pure Python code on multiple CPUs in parallel with a single process (whereas multiprocessing runs multiple processes).
FTR, PEP 554 is explicitly independent of efforts to stop sharing the GIL between interpreters. I argue there that it is a good idea regardless. The existing functionality the PEP exposes, though, clearly benefits from better isolation between interpreters (including not sharing the GIL). :) On Thu, Nov 14, 2019 at 4:12 AM Victor Stinner <vstinner@python.org> wrote:
Another approach would be to pass a "PyContext*" pointer which contains tstate, but also additional fields. But I chose to state with a direct "PyThreadState* tstate" to avoid one indirection to every tstate access. Currently, tstate seems to be enough for the current code base.
FWIW, I favor this approach as well. As long as it is an opaque type, a PyContext allows us to be more flexible in adapting to the future. For now it could even be a simple alias for PyThreadState. Regardless, I'm not convinced that using a PyContext will have a real impact on runtime performance. Also, we already use "context" in a number of ways in Python. So "PyContext" might not be the best name. It probably needs to be a name without "context" in it or one with a concrete clue (e.g. 'PyRuntimeContext"). Anyway, thanks for driving this discussion, Victor! -eric

On Sat., 16 Nov. 2019, 7:29 am Eric Snow, <ericsnowcurrently@gmail.com> wrote:
On Thu, Nov 14, 2019 at 4:12 AM Victor Stinner <vstinner@python.org> wrote:
Another approach would be to pass a "PyContext*" pointer which contains tstate, but also additional fields. But I chose to state with a direct "PyThreadState* tstate" to avoid one indirection to every tstate access. Currently, tstate seems to be enough for the current code base.
FWIW, I favor this approach as well. As long as it is an opaque type, a PyContext allows us to be more flexible in adapting to the future. For now it could even be a simple alias for PyThreadState. Regardless, I'm not convinced that using a PyContext will have a real impact on runtime performance.
Also, we already use "context" in a number of ways in Python. So "PyContext" might not be the best name. It probably needs to be a name without "context" in it or one with a concrete clue (e.g. 'PyRuntimeContext").
I think we should just stick with "PyThreadState", as that makes it clear that in normal circumstances, it means "the Python State for the currently running Thread". If a function accepting this parameter needs to call back in to Python code, or invokes a function pointer that might call back into the public C API, it's going to need to enforce that assumption by switching the active thread state if necessary. You can already navigate from the thread state to the interpreter state and runtime state, so it should cover everything that we need. Cheers, Nick.

As you know, I'm skeptical that PEP 554 will produce benefits that are worth the effort, but let's assume for the moment that it is, and we're all 100% committed to moving all globals into the threadstate. Even given that, the motivation for this change seems a bit unclear to me. I guess the possible goals are: - Get rid of the "ambient" threadstate entirely - Make accessing the threadstate faster For the first goal, I don't think this is possible, or desirable. Obviously if we remove the GIL somehow then at a minimum we'll need to make the global threadstate a thread-local. But I think we'll always have to keep it around as a thread-local, at least, because there are situations where you simply cannot pass in the threadstate as an argument. One example comes up when doing FFI: there are C libraries that take callbacks, and will run them later in some arbitrary thread. When wrapping these in Python, we need a way to bundle up a Python function into a C function that can be called from any thread. So, ctypes and cffi and cython all have ways to do this bundling, and they all start with some delicate dance to figure out whether or not the current thread holds the GIL, acquiring the GIL if not, then checking whether or not this thread has a Python threadstate assigned, creating it if not, etc. This is completely dependent on having the threadstate available in ambient context. If threadstates were always passed as arguments, then it would become impossible to wrap these C libraries. So we can't do that. That said, it's fine – even if we do remove the GIL, we still won't have a *single OS thread* executing code from two different interpreters at the same time! So storing the threadstate in a thread-local is fine, and we can keep the ability to grab the threadstate at any moment, regardless of whether it was passed as an argument. But that means the only reason for passing the threadstate around as an argument is if it's faster than looking it up. And AFAICT, no-one in this thread actually knows if that's true? You mentioned that there's an "atomic operation" there currently, but I think on x86 at least _Py_atomic_load_relaxed is literally a no-op. Larry did some experiments with the old pthreads thread-local storage API, but no-one seems to have done any measurements on the new, much-faster thread-local storage API, and no-one's done any measurements of the cost of passing around threadstates explicitly. For all we know, passing the threadstate around is actually slower than looking it up every time. And we don't even know yet whether the threadstate even will move into thread-local storage. It seems a bit weird to start doing massive internal refactoring before measuring those things. -n On Tue, Nov 12, 2019 at 2:03 PM Victor Stinner <vstinner@python.org> wrote:
Hi,
Are you ok to modify internal C functions to pass explicitly tstate?
--
I started to modify internal C functions to pass explicitly "tstate" when calling C functions: the Python thread state (PyThreadState). Example of C code (after my changes):
if (_Py_EnterRecursiveCall(tstate, " while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); _Py_LeaveRecursiveCall(tstate); return _Py_CheckFunctionResult(tstate, callable, result, NULL);
In Python 3.8, the tstate is implicit:
if (Py_EnterRecursiveCall(" while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); Py_LeaveRecursiveCall(); return _Py_CheckFunctionResult(callable, result, NULL);
There are different reasons to pass explicitly tstate, but my main motivation is to rework Python code base to move away from implicit global states to states passed explicitly, to implement the PEP 554 "Multiple Interpreters in the Stdlib". In short, the final goal is to run multiple isolated Python interpreters in the same process: run pure Python code on multiple CPUs in parallel with a single process (whereas multiprocessing runs multiple processes).
Currently, subinterpreters are a hack: they still share a lot of things, the code base is not ready to implement isolated interpreters with one "GIL" (interpreter lock) per interpreter, and to run multiple interpreters in parallel. Many _PyRuntimeState fields (the global _PyRuntime variable) should be moved to PyInterpreterState (or maybe PyThreadState): per interpreter.
Another simpler but more annoying example are Py_None and Py_True singletons which are globals. We cannot share these singletons between interpreters because updating their reference counter would be a performance bottleneck. If we put a "superglobal-GIL" to ensure that Py_None reference counter remains consistent, it would basically "serialize" all threads, rather than running them in parallel.
The idea of passing tstate to internal C functions is to prepare code to get the per-interpreter None from tstate.
tstate is basically the "root" to access all states which are per interpreter. For example, PyInterpreterState can be read from tstate->interp.
Right now, tstate is only passed to a few functions, but you should expect to see it passed to way more functions later, once more structures will be moved to PyInterpreterState.
--
On my latest merged PR 17052 ("Add _PyObject_VectorcallTstate()"), Mark Shannon wrote: "I don't see how this could ever be faster, nor do I see how it is more correct." https://github.com/python/cpython/pull/17052#issuecomment-552538438
Currently, tstate is get using these internal APIs:
#define _PyRuntimeState_GetThreadState(runtime) \ ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current)) #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)
or using public APIs:
PyAPI_FUNC(PyThreadState *) PyThreadState_Get(void); #define PyThreadState_GET() PyThreadState_Get()
I dislike _PyThreadState_GET() for 2 reasons:
* it relies on the _PyRuntime global variable: I would prefer to avoid global variables * it uses an atomic operation which can become a perofrmance issue when more and more code will require tstate
--
An alternative would be to use PyGILState_GetThisThreadState() which uses a thread local state (TLS) variable to get the Python thread state ("tstate"), rather that _PyRuntime atomic variable. Except that the PyGILState API doesn't support subinterpreters yet :-(
https://bugs.python.org/issue15751 "Support subinterpreters in the GIL state API" is open since 2012.
Note: While the GIL is released, _PyThreadState_GET() is NULL, whereas PyGILState_GetThisThreadState() is non-NULL.
--
Links:
* https://pythoncapi.readthedocs.io/runtime.html : my notes on moving globals to per interpreter states * https://bugs.python.org/issue36710 * https://bugs.python.org/issue38644
Victor -- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PQBGECVG... Code of Conduct: http://python.org/psf/codeofconduct/
-- Nathaniel J. Smith -- https://vorpus.org

On AMD64 Linux, the location of the thread local data seems to be stored in the GS CPU register[1]. It seems likely other platforms and other operating systems could do something similar. Passing threadstate as an explicit argument could be either faster or slower depending on how often you use it. If you use threadstate often, passing it explicitly (which likely uses a CPU register) could be a win. If you use it rarely, that CPU register would be better utilized for passing function arguments you actually use. Doing some experiments with optimized (i.e. using platform specific) TLS would seem a useful step before undertaking a major refactoring. Explicit passing could be a lot of code churn for no practical gain. 1. https://stackoverflow.com/questions/6611346/how-are-the-fs-gs-registers-used...

Le sam. 16 nov. 2019 à 20:55, Neil Schemenauer <nas-python@arctrix.com> a écrit :
If you use threadstate often, passing it explicitly (which likely uses a CPU register) could be a win. If you use it rarely, that CPU register would be better utilized for passing function arguments you actually use.
Currently, I would say that it's used "rarely". But. If we want to implement subinterpreters, we have to use way more often. Since each interpreter must have its isolated namespace, I expect that even 1+1 should use tstate to get the 2 "singleton" from its private namespace, rather than using a "global" singleton. Basically, all builtin types and all builtin modules should be modified to have one namespace per interpreter. For C extensions, it's an old project to have a "state" passed to module functions, and so be able to have 2 separated instances of the same C extension, rather than having a single global namespace. Examples: https://www.python.org/dev/peps/pep-0489/ https://www.python.org/dev/peps/pep-0573/ I would like to implement subinterpreters. IMHO the project is feasible and if it works, it would make Python more competitive with other programming languages! IMHO fixing the C API (or write a new one) and subinterpreters are the only two most feasible and most realistic projects to optimize CPython right now. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On Sat., 16 Nov. 2019, 8:26 am Nathaniel Smith, <njs@pobox.com> wrote:
As you know, I'm skeptical that PEP 554 will produce benefits that are worth the effort, but let's assume for the moment that it is, and we're all 100% committed to moving all globals into the threadstate. Even given that, the motivation for this change seems a bit unclear to me.
I guess the possible goals are:
- Get rid of the "ambient" threadstate entirely - Make accessing the threadstate faster
- Eventually make it easier for CPython maintainers to know which functions require access to a live thread state, and which are stateless helper functions - Eventually make it easier for embedding applications to control which Python code runs in which thread state by moving the thread state activation dance out of the application and into the CPython shared library (We actually broke the thread state activation in hexchat not that long ago - there was a subtle latent defect in how they were handling it, and the changes to interpreter cleanup escalated it to a full blown crash) The need for the implicit thread state is never going to go away, but there are definitely opportunities to make the way we manage it less bug prone. (e.g. In the HPy work, I would expect each handle to be at least bound to an interpreter, and there could even be a higher level construct to associate callbacks with a specific thread state) Cheers, Nick.

On Sun, Nov 17, 2019 at 1:58 PM Nick Coghlan <ncoghlan@gmail.com> wrote:
On Sat., 16 Nov. 2019, 8:26 am Nathaniel Smith, <njs@pobox.com> wrote:
As you know, I'm skeptical that PEP 554 will produce benefits that are worth the effort, but let's assume for the moment that it is, and we're all 100% committed to moving all globals into the threadstate. Even given that, the motivation for this change seems a bit unclear to me.
I guess the possible goals are:
- Get rid of the "ambient" threadstate entirely - Make accessing the threadstate faster
- Eventually make it easier for CPython maintainers to know which functions require access to a live thread state, and which are stateless helper functions
So the idea would be that eventually we'd remove all uses of implicit state lookup inside CPython, and add some kind of CI check to make sure that they're never used?
- Eventually make it easier for embedding applications to control which Python code runs in which thread state by moving the thread state activation dance out of the application and into the CPython shared library
That seems like a good goal, but I don't understand how it's related to passing threadstate explicitly as a function argument. If the plan is to move towards passing threadstates both implicitly AND explicitly everywhere, that seems like it would make things more error-prone, not less, because the two states could get out of sync. Could you elaborate? -n -- Nathaniel J. Smith -- https://vorpus.org

On Mon., 18 Nov. 2019, 8:19 am Nathaniel Smith, <njs@pobox.com> wrote:
- Eventually make it easier for embedding applications to control which Python code runs in which thread state by moving the thread state activation dance out of the application and into the CPython shared library
That seems like a good goal, but I don't understand how it's related to passing threadstate explicitly as a function argument. If the plan is to move towards passing threadstates both implicitly AND explicitly everywhere, that seems like it would make things more error-prone, not less, because the two states could get out of sync. Could you elaborate?
What I said my original reply: if an API that accepts an explicit thread state ever calls an API that expects an implicit one, we'll need to internally implement the dance to activate the supplied thread state before making that call. At the moment, we expect callers of the public API to do that dance, and it's tricky to get it right in all cases. My hope (and it's a subjective hope, not an objective fact) is that implementing the dance more often ourselves will help us identify future abstractions that will make the public API easier to use correctly in multi-threaded applications. Cheers, Nick.
-n
-- Nathaniel J. Smith -- https://vorpus.org

On Fri, 15 Nov 2019 14:21:53 -0800 Nathaniel Smith <njs@pobox.com> wrote:
As you know, I'm skeptical that PEP 554 will produce benefits that are worth the effort, but let's assume for the moment that it is, and we're all 100% committed to moving all globals into the threadstate. Even given that, the motivation for this change seems a bit unclear to me.
I guess the possible goals are:
- Get rid of the "ambient" threadstate entirely - Make accessing the threadstate faster
For the first goal, I don't think this is possible, or desirable. Obviously if we remove the GIL somehow then at a minimum we'll need to make the global threadstate a thread-local. But I think we'll always have to keep it around as a thread-local, at least, because there are situations where you simply cannot pass in the threadstate as an argument. One example comes up when doing FFI: there are C libraries that take callbacks, and will run them later in some arbitrary thread. When wrapping these in Python, we need a way to bundle up a Python function into a C function that can be called from any thread. So, ctypes and cffi and cython all have ways to do this bundling, and they all start with some delicate dance to figure out whether or not the current thread holds the GIL, acquiring the GIL if not, then checking whether or not this thread has a Python threadstate assigned, creating it if not, etc. This is completely dependent on having the threadstate available in ambient context. If threadstates were always passed as arguments, then it would become impossible to wrap these C libraries.
Most well-designed C libraries let you pass an additional "void*" parameter for user callbacks to be called with. A couple of them don't, unfortunately (OpenSSL perhaps? I don't remember). Regards Antoine.

On Mon, Nov 18, 2019, at 05:26, Antoine Pitrou wrote:
For the first goal, I don't think this is possible, or desirable. Obviously if we remove the GIL somehow then at a minimum we'll need to make the global threadstate a thread-local. But I think we'll always have to keep it around as a thread-local, at least, because there are situations where you simply cannot pass in the threadstate as an argument. One example comes up when doing FFI: there are C libraries that take callbacks, and will run them later in some arbitrary thread. When wrapping these in Python, we need a way to bundle up a Python function into a C function that can be called from any thread. So, ctypes and cffi and cython all have ways to do this bundling, and they all start with some delicate dance to figure out whether or not the current thread holds the GIL, acquiring the GIL if not, then checking whether or not this thread has a Python threadstate assigned, creating it if not, etc. This is completely dependent on having the threadstate available in ambient context. If threadstates were always passed as arguments, then it would become impossible to wrap these C libraries.
Most well-designed C libraries let you pass an additional "void*" parameter for user callbacks to be called with. A couple of them don't, unfortunately (OpenSSL perhaps? I don't remember).
I think you've missed the fact that the C library runs the callback on an arbitrary thread. The threadstate associated with the thread that made the original call is therefore *not the one you want*; you want a threadstate associated with the thread the callback is run on. Alternately, if a thread state is not in any sense associated with a thread (would these situations then mean you simply always create a brand-new interpreter state?), maybe it shouldn't be called a thread state at all.

On Mon, 18 Nov 2019 12:39:00 -0500 Random832 <random832@fastmail.com> wrote:
On Mon, Nov 18, 2019, at 05:26, Antoine Pitrou wrote:
For the first goal, I don't think this is possible, or desirable. Obviously if we remove the GIL somehow then at a minimum we'll need to make the global threadstate a thread-local. But I think we'll always have to keep it around as a thread-local, at least, because there are situations where you simply cannot pass in the threadstate as an argument. One example comes up when doing FFI: there are C libraries that take callbacks, and will run them later in some arbitrary thread. When wrapping these in Python, we need a way to bundle up a Python function into a C function that can be called from any thread. So, ctypes and cffi and cython all have ways to do this bundling, and they all start with some delicate dance to figure out whether or not the current thread holds the GIL, acquiring the GIL if not, then checking whether or not this thread has a Python threadstate assigned, creating it if not, etc. This is completely dependent on having the threadstate available in ambient context. If threadstates were always passed as arguments, then it would become impossible to wrap these C libraries.
Most well-designed C libraries let you pass an additional "void*" parameter for user callbacks to be called with. A couple of them don't, unfortunately (OpenSSL perhaps? I don't remember).
I think you've missed the fact that the C library runs the callback on an arbitrary thread. The threadstate associated with the thread that made the original call is therefore *not the one you want*; you want a threadstate associated with the thread the callback is run on.
Ah, right, I had overlooked that mention. This does complicate things a bit. In that case you would want to pass the interpreter state and then use this particular interpreter's mapping of OS thread to threadstate. (assuming that per-interpreter mapping exists, which is another question; but it will have to exist at some point for PEP 554) Regards Antoine.
participants (13)
-
Antoine Pitrou
-
Eric Snow
-
Eric V. Smith
-
Jim J. Jewett
-
Larry Hastings
-
Nathaniel Smith
-
Neil Schemenauer
-
Nick Coghlan
-
Petr Viktorin
-
Random832
-
Stefan Behnel
-
Steve Dower
-
Victor Stinner