Should we pass tstate in the VECTORCALL calling convention before making it public?

Hi, I started to modify Python internals to pass explicitly the Python thread state ("tstate") to internal C a functions: https://vstinner.github.io/cpython-pass-tstate.html Until subinterpreters will be fully implemented, it's unclear if getting the current Python thread state using _PyThreadState_GET() macro (which uses an atomic read) will remain efficient. For example, right now, the "GIL state" API doesn't support subinterpreters: fixing this may require to add a lock somewhere which may make _PyThreadState_GET() less efficient. Sorry, I don't have numbers, since I didn't experiment to implement these changes yet: I was blocked by other issues. We can only guess at this point. To me, it sounds like a good practice to pass the current Python thread state to internal C functions. It seems like most core developers agreed with that in my previous python-dev thread "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGV... The question is now if we should "propagate" tstate to function calls in the latest VECTORCALL calling convention (which is currently private). Petr Viktorin plans to make VECTORCALL APIs public in Python 3.9, as planned in the PEP 590: https://bugs.python.org/issue39245 I added explicitly Stefan Behnel in copy, since Cython should be directly impacted by such change. Cython is the kind of project which may benefit of having tstate accessible directly. I started to move more and more things from "globals" to "per interpreter". For example, the garbage collector is now "per interpreter" (lives in PyThreadState). Small integer singletons are now also "per singleton": int("1") are now different objects in each interpreter, whereas they were shared previously. Later, even "None" singleton (and all other singletons) should be made "per interpreter". Getting a "per interpreter" object requires to state from the Python thread state: call _PyThreadState_GET(). Avoiding _PyThreadState_GET() calls reduces any risk of making Python slower with incoming subinterpreter changes. For the long term, the goal is that each subinterpreter has its own isolated world: no Python object should be shared, no state should be shared. The intent is to avoid any need of locking, to maximize performance when running interpreters in parallel. Obviously, each interpreter will have its own private GIL ;-) Py_INCREF() and Py_DECREF() would require slow atomic operations if Python objects are shared. If objects are not shared between interpreters, Py_INCREF() and Py_DECREF() can remain as fast as they are today. Any additional locking may kill performance. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 08Jan2020 0746, Victor Stinner wrote:
The question is now if we should "propagate" tstate to function calls in the latest VECTORCALL calling convention (which is currently private). +1, for all the same reasons I'm +1 on passing it to C functions explicitly. (In short, embedding can be very painful when you have to manipulate ambient thread state to make Python code work, and this is a good step towards getting rid of that requirement.)
Cheers, Steve

On Thu, 9 Jan 2020 at 01:49, Victor Stinner <vstinner@python.org> wrote:
The question is now if we should "propagate" tstate to function calls in the latest VECTORCALL calling convention (which is currently private). Petr Viktorin plans to make VECTORCALL APIs public in Python 3.9, as planned in the PEP 590: https://bugs.python.org/issue39245
The vectorcall convention places a significant emphasis on speed, so being able to do a single PyThreadState_Get() call on the initial transition from Python to C, and then pass the thread state explicitly after that would make a lot of sense. So I think this is a good idea, but we might need to place a caveat on Python 3.9 that passing in a thread state that is NOT the currently active thread state may not work correctly yet (it's going to take a while to make sure that we never inadvertently call back into "implicit thread state" APIs from the implementation of APIs that accept an explicit thread state, and even after CPython itself is fixed, there's no telling what other extension modules might be doing). The transition to the full public vectorcall API already keeps the underscore prefixed versions around as aliases, so if the public API changes to accept an explicit thread state, then the underscore prefixed versions can be kept as wrapper functions that implicitly use the current thread state. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le mer. 8 janv. 2020 à 22:15, Nick Coghlan <ncoghlan@gmail.com> a écrit :
The vectorcall convention places a significant emphasis on speed, so being able to do a single PyThreadState_Get() call on the initial transition from Python to C, and then pass the thread state explicitly after that would make a lot of sense.
Honestly, I'm not 100% sure that performance is a good argument in favor of adding tstate to the calling convention, since we cannot measure yet the overhead in term of performance of calling frequently _PyThreadState_GET(), especially with hypothetical changes on the GIL state API. I can only "feel" that it might be a performance issue tomorrow. For me a stronger argument is that it should help to make Python more "correct" in the case of subinterpreters. Help to get the right state, since today too many things are "implicit" (rely on "globals").
So I think this is a good idea, but we might need to place a caveat on Python 3.9 that passing in a thread state that is NOT the currently active thread state may not work correctly yet (it's going to take a while to make sure that we never inadvertently call back into "implicit thread state" APIs from the implementation of APIs that accept an explicit thread state, and even after CPython itself is fixed, there's no telling what other extension modules might be doing).
In term of API, we may keep the current *API* unchanged: pass implicitly the current Python thread state. For example, _PyObject_Vectorcall() doesn't take a tstate argument, but gets the current Python thread state. static inline PyObject * _PyObject_Vectorcall(PyObject *callable, PyObject *const *args, size_t nargsf, PyObject *kwnames) { PyThreadState *tstate = PyThreadState_GET(); return _PyObject_VectorcallTstate(tstate, callable, args, nargsf, kwnames); } (I added _PyObject_VectorcallTstate() recently :-)) We can keep this prototype for the public API, but have a *private* API which allows to pass tstate. My idea is that internally, tstate would be passed *explicitly*. For example: res = PyObject_Vectorcall(func, ...); // public API, 3rd party code vs res = _PyObject_Vectorcall(tstate, func, ...); // private API, Python internals My first question is about the calling convention, not the API ;-) Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 08/01/2020 3:46 pm, Victor Stinner wrote:
Hi,
I started to modify Python internals to pass explicitly the Python thread state ("tstate") to internal C a functions: https://vstinner.github.io/cpython-pass-tstate.html
Passing the thread state explicitly creates a new class of errors that was not there before. What happens if the tstate argument is NULL, or points to a different thread?
Until subinterpreters will be fully implemented, it's unclear if getting the current Python thread state using _PyThreadState_GET() macro (which uses an atomic read) will remain efficient. For example, right now, the "GIL state" API doesn't support subinterpreters: fixing this may require to add a lock somewhere which may make _PyThreadState_GET() less efficient. Sorry, I don't have numbers, since I didn't experiment to implement these changes yet: I was blocked by other issues. We can only guess at this point.
There is a one-to-one correspondence between Python threads and O/S threads. So the threadstate can, and should, be stored in a thread local variable. Accessing thread local storage is fast. x86/64 uses the fs register to point to it, whereas ARM dedicates R15 (I think). Example code for x64 using C99: https://godbolt.org/z/AZp9TS C11 provides the performance of the static case with the advantages external linkage. https://godbolt.org/z/Xth7vb
To me, it sounds like a good practice to pass the current Python thread state to internal C functions. It seems like most core developers agreed with that in my previous python-dev thread "Pass the Python thread state to internal C functions":
I don't :)
https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGV...
The question is now if we should "propagate" tstate to function calls in the latest VECTORCALL calling convention (which is currently private). Petr Viktorin plans to make VECTORCALL APIs public in Python 3.9, as planned in the PEP 590: https://bugs.python.org/issue39245
I say no. Passing the thread state was an option that I rejected when designing PEP 590.
I added explicitly Stefan Behnel in copy, since Cython should be directly impacted by such change. Cython is the kind of project which may benefit of having tstate accessible directly.
I started to move more and more things from "globals" to "per interpreter". For example, the garbage collector is now "per interpreter" (lives in PyThreadState). Small integer singletons are now also "per singleton": int("1") are now different objects in each interpreter, whereas they were shared previously. Later, even "None" singleton (and all other singletons) should be made "per interpreter". Getting a "per interpreter" object requires to state from the Python thread state: call _PyThreadState_GET(). Avoiding _PyThreadState_GET() calls reduces any risk of making Python slower with incoming subinterpreter changes.
Thread locals are not "global". Each sub-interpreter will have its own pool of threads. Each threadstate object should contain a pointer to its sub-interpreter.
For the long term, the goal is that each subinterpreter has its own isolated world: no Python object should be shared, no state should be shared. The intent is to avoid any need of locking, to maximize performance when running interpreters in parallel. Obviously, each interpreter will have its own private GIL ;-) Py_INCREF() and Py_DECREF() would require slow atomic operations if Python objects are shared. If objects are not shared between interpreters, Py_INCREF() and Py_DECREF() can remain as fast as they are today. Any additional locking may kill performance.
I agree entirely, but this is IMO unrelated to passing the thread state as an explicit extra argument.
Victor

Le jeu. 9 janv. 2020 à 13:35, Mark Shannon <mark@hotpy.org> a écrit :
Passing the thread state explicitly creates a new class of errors that was not there before. What happens if the tstate argument is NULL,
You will likely get a crash.
or points to a different thread?
No idea what will happen. If it's a different subinterpreter, bad things can happen I guess :-) Well, I don't see this class of bug as a blocker issuer. If you pass invalid data, you get a crash. I'm not surprised by that. In my previous email, I proposed to continue to pass implicitly tstate when you call a function using the public C API. Only Python internals would pass explicitly tstate and so only internals should be carefully reviewed.
There is a one-to-one correspondence between Python threads and O/S threads.
I'm not convinced that this :-) So far, it's unclear to me how the transition from one interpreter to another happens with Python thread states. Let's say that the main thead spawns a thread A. The main thread and the thread A are running in the main interpreter. Then thread A calls _testcapi.run_in_subinterp() or _xxsubinterpreter.run_string() (or something else to run code in a subinterpreter). A subinterpreter is created and a new and different Python thread state is "attached" to thread A. The thread A gets 2 Python thread states, one per interpreter. It gets one or the other depending which in which interpreter the thread is running... We have to be careful to pass the proper thread state to internal C functions while doing this dance between two interpreters in the same thread. PyThreadState_Swap(tstate) must be called to set the current Python thread state. Full implementation of _testcapi.run_in_subinterp(), notice the PyThreadState_Swap() dance: /* To run some code in a sub-interpreter. */ static PyObject * run_in_subinterp(PyObject *self, PyObject *args) { const char *code; int r; PyThreadState *substate, *mainstate; if (!PyArg_ParseTuple(args, "s:run_in_subinterp", &code)) return NULL; mainstate = PyThreadState_Get(); PyThreadState_Swap(NULL); substate = Py_NewInterpreter(); if (substate == NULL) { /* Since no new thread state was created, there is no exception to propagate; raise a fresh one after swapping in the old thread state. */ PyThreadState_Swap(mainstate); PyErr_SetString(PyExc_RuntimeError, "sub-interpreter creation failed"); return NULL; } r = PyRun_SimpleString(code); Py_EndInterpreter(substate); PyThreadState_Swap(mainstate); return PyLong_FromLong(r); }
So the threadstate can, and should, be stored in a thread local variable. Accessing thread local storage is fast. x86/64 uses the fs register to point to it, whereas ARM dedicates R15 (I think).
*If* we managed to design subinterpreters in a way that a native thread is always associated to the same Python interpreter, moving to a thread local variable can make Python more efficient :-) It is likely to be more efficient than the current atomic variable design.
I started to move more and more things from "globals" to "per interpreter". For example, the garbage collector is now "per interpreter" (lives in PyThreadState). Small integer singletons are now also "per singleton": int("1") are now different objects in each interpreter, whereas they were shared previously. Later, even "None" singleton (and all other singletons) should be made "per interpreter". Getting a "per interpreter" object requires to state from the Python thread state: call _PyThreadState_GET(). Avoiding _PyThreadState_GET() calls reduces any risk of making Python slower with incoming subinterpreter changes.
Thread locals are not "global". Each sub-interpreter will have its own pool of threads. Each threadstate object should contain a pointer to its sub-interpreter.
Getting the interpreter from a Python thread state is trivial: interp = tstate->interp. The problem is how to get the current Python thread state. *Currently*, it's an atomic variable. But tomorrow with multiple interpreters running in parallel, I expect that it's going to me more expensive like first having the current the interpreter running the current native thread, and then get the Python thread state of this interpreter. Or something like that. We may get more and more indirections... Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On Thu, 9 Jan 2020 15:12:41 +0100 Victor Stinner <vstinner@python.org> wrote:
Getting the interpreter from a Python thread state is trivial: interp = tstate->interp.
The problem is how to get the current Python thread state. *Currently*, it's an atomic variable. But tomorrow with multiple interpreters running in parallel, I expect that it's going to me more expensive like first having the current the interpreter running the current native thread, and then get the Python thread state of this interpreter. Or something like that. We may get more and more indirections...
It can still be simple. Store the current interpreter in a thread-local variable (there can be only one interpreter running at a given time in a given OS thread, even if it's not always the same interpreter). Then have the interpreter struct contain a pointer to the current running thread *in that interpreter*. So you have one thread-local access to the current interpreter and then a normal struct member access to the current thread state. For example: struct _ts; struct _is { struct _ts* current_thread; }; _Thread_local struct _is *current_interp; inline struct _is *_PyInterpreterState_GET() { return current_interp; } inline struct _ts *_PyThreadState_GET() { return current_interp->current_thread; } Regards Antoine.

On 09Jan2020 0429, Mark Shannon wrote:
There is a one-to-one correspondence between Python threads and O/S threads. So the threadstate can, and should, be stored in a thread local variable.
This is a major issue for embedders (and potentially asyncio), as it prevents integration with existing thread models or event loops. Requiring an _active_ Python thread (~GIL held) to only run on a single OS thread is fine, but tying it permanently to a single OS thread makes things very painful. (Of course, this isn't the only thing that makes it painful, but if we're going to make this a deliberate design of CPython then we are essentially excluding entire interesting classes of embedding.)
Accessing thread local storage is fast. x86/64 uses the fs register to point to it, whereas ARM dedicates R15 (I think).
The register used is OS-specific. We do (and should) use the provided TLS APIs, but these add overhead.
Thread locals are not "global". Each sub-interpreter will have its own pool of threads. Each threadstate object should contain a pointer to its sub-interpreter.
I agree with this, but it's an argument for passing PyThreadState explicitly, which seems to go against your previous point (unless you're insisting that subinterpreters *must* be tied to specific and distinct "physical" threads, in which case let's argue about that because I think you're wrong :) ). Cheers, Steve

Le jeu. 9 janv. 2020 à 19:33, Steve Dower <steve.dower@python.org> a écrit :
Requiring an _active_ Python thread (~GIL held) to only run on a single OS thread is fine, but tying it permanently to a single OS thread makes things very painful. (Of course, this isn't the only thing that makes it painful, but if we're going to make this a deliberate design of CPython then we are essentially excluding entire interesting classes of embedding.)
Do you have an use case where you reuse the same Python thread state in different native threads? Currently, the Python runtime expects to have a different Python thread state per native thread. For example, PyGILState_Ensure() creates one if there is no Python thread state associated to the current thread. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 09Jan2020 1659, Victor Stinner wrote:
Le jeu. 9 janv. 2020 à 19:33, Steve Dower <steve.dower@python.org> a écrit :
Requiring an _active_ Python thread (~GIL held) to only run on a single OS thread is fine, but tying it permanently to a single OS thread makes things very painful. (Of course, this isn't the only thing that makes it painful, but if we're going to make this a deliberate design of CPython then we are essentially excluding entire interesting classes of embedding.)
Do you have an use case where you reuse the same Python thread state in different native threads?
I do, but I can't share exactly what it is (yet) :) But essentially, the main application has its own threading setup where callbacks happen on its own threadpool. Some of those callbacks sometimes have to raise Python events, but all of those events should be serialized (i.e. *no* Python code should run in parallel, even though the app's callbacks can). Note that this is not arbitrary CPython code. It is a very restricted context (only first-party modules, no threading/ctypes/etc.). Ideally, we'd just lock the Python thread and run the Python code and get the result back directly. Instead, we've had to set up a parallel Python thread that is constantly running, and build a complicated cross-thread marshalling setup in order to simulate synchronous calls. (And also argue with the application architects who are very against mixing threading models in their app, unsurprisingly.)
Currently, the Python runtime expects to have a different Python thread state per native thread. For example, PyGILState_Ensure() creates one if there is no Python thread state associated to the current thread.
And ultimately, everything above is just because of this assumption. Turns out that simply transferring the thread state and setting it in TLS is *nearly* okay most of the time, but not quite. Breaking down the assumption that each CPython thread is the only code running on its particular OS thread would help a lot here - making the CPython thread state a parameter rather than ambient state is the right direction. Cheers, Steve

Victor Stinner wrote:
I started to modify Python internals to pass explicitly the Python thread state ("tstate") to internal C a functions:
Is work this summarized somewhere like a PEP?
describes the motivation, but not the details on why what is changed. In particular, I am wondering what proportion of functions actually need the tstate, vs. how often it would just be adding some extra pack-pass-unpack work for no purpose.
To me, it sounds like a good practice to pass the current Python thread state to internal C functions.
Only if they need it (see above) and the passed data is at least as reliable as what they would get by using the current process. If the current process always gets the right answer, and the new process can sometimes gets the wrong one (even just through human error), that seems like a high price to pay.
... Later, even "None" singleton (and all other singletons) should be made "per interpreter". [ later no objects should be shared ]
At that point, it makes sense to feed them from different memory pools, and I am not sure how much advantage there is to a separate interpreter as opposed to a separate process.

Le ven. 10 janv. 2020 à 20:44, Jim J. Jewett <jimjjewett@gmail.com> a écrit :
Is work this summarized somewhere like a PEP?
This work is related to: PEP 554 -- Multiple Interpreters in the Stdlib https://www.python.org/dev/peps/pep-0554/ Victor
participants (6)
-
Antoine Pitrou
-
Jim J. Jewett
-
Mark Shannon
-
Nick Coghlan
-
Steve Dower
-
Victor Stinner