data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
Hi, There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions. These changes are listed under https://bugs.python.org/issue38644. I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways. The changes add bulk to the C-API and may hurt performance. These changes are also causing a lot of churn and merge conflicts (for me at least). So can we please stop making these changes, at least now? Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review. Cheers, Mark.
data:image/s3,"s3://crabby-images/f2cb6/f2cb6403da92e69ee6cc8c3fb58b22cdceb03681" alt=""
Hi,
Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review.
Last November, I asked explicitly on python-dev if we should "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGV... In short, the answer is yes. There is no PEP but scatted documents. I wrote a short article to elaborate the context of this work: https://vstinner.github.io/cpython-pass-tstate.html One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation. -- In the meanwhile, I modified "small integer singletons" to make them "per-interpreter". So tstate->interp is now used to get small integers in longobject.c. I also opened a discussion on other singletons (None, True, False, ...): https://bugs.python.org/issue39511 The long-term goal is to be able to run multiple isolated interpreters in parallel. Le lun. 16 mars 2020 à 15:16, Mark Shannon <mark@hotpy.org> a écrit :
There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions.
So far, tstate should only be passed to *internal* C functions. I don't think that the public C API has been modified to pass tstate.
These changes are listed under https://bugs.python.org/issue38644.
There was also https://bugs.python.org/issue36710
I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways.
Would you mind to elaborate?
The changes add bulk to the C-API and may hurt performance.
Did you notice that in benchmarks? I would be curious to see the overhead.
These changes are also causing a lot of churn and merge conflicts (for me at least).
Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
data:image/s3,"s3://crabby-images/f2cb6/f2cb6403da92e69ee6cc8c3fb58b22cdceb03681" alt=""
There were quick discussions about using thread local storage (TLS) to get and set the current Python thread state ("tstate"), instead of reading/setting an atomic variable (_PyRuntime.gilstate.tstate_current). In fact, TLS already exists as "PyGILState" and PyGILState_GetThisThreadState() can already return the current thread state. But this API doesn't work currently with subinterpreters: * https://bugs.python.org/issue10915 * https://bugs.python.org/issue15751 It's unclear to me if fixing this issue would require to add a lock, nor if it would make PyGILState_GetThisThreadState() or _PyThreadState_GET() slower. Moreover, currently, it's possible to have two Python thread state for the same native thread. That's needed for the transition from the main interpreter to new subinterpreter. Example: --- mainstate = PyThreadState_Get(); PyThreadState_Swap(NULL); substate = Py_NewInterpreter(); r = PyRun_SimpleString(code); Py_EndInterpreter(substate); PyThreadState_Swap(mainstate); --- where Py_NewInterpreter() creates a new Python thread state using PyThreadState_New() and sets it as the current Python thread state. Maybe the subinterpreter API can evolve to never "attach" two different interpreters to the same thread: one Python thread state would only belong to a single native thread. I would be interested to explore the option of using TLS everywhere, but first we need to solve all these tricky issues. Again, so far, passing tstate explicitly is only done internally, so if we switch to another solution which doesn't require to pass tstate explicitly, we can do that without breaking the public C API. -- I hope that you now have a better overview of the current state of the Python implementation and how it evolves in the last two years. Victor Le lun. 16 mars 2020 à 16:04, Victor Stinner <vstinner@python.org> a écrit :
Hi,
Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review.
Last November, I asked explicitly on python-dev if we should "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGV...
In short, the answer is yes.
There is no PEP but scatted documents. I wrote a short article to elaborate the context of this work: https://vstinner.github.io/cpython-pass-tstate.html
One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation.
--
In the meanwhile, I modified "small integer singletons" to make them "per-interpreter". So tstate->interp is now used to get small integers in longobject.c.
I also opened a discussion on other singletons (None, True, False, ...): https://bugs.python.org/issue39511
The long-term goal is to be able to run multiple isolated interpreters in parallel.
Le lun. 16 mars 2020 à 15:16, Mark Shannon <mark@hotpy.org> a écrit :
There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions.
So far, tstate should only be passed to *internal* C functions. I don't think that the public C API has been modified to pass tstate.
These changes are listed under https://bugs.python.org/issue38644.
There was also https://bugs.python.org/issue36710
I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways.
Would you mind to elaborate?
The changes add bulk to the C-API and may hurt performance.
Did you notice that in benchmarks? I would be curious to see the overhead.
These changes are also causing a lot of churn and merge conflicts (for me at least).
Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters.
Victor -- Night gathers, and now my watch begins. It shall not end until my death.
-- Night gathers, and now my watch begins. It shall not end until my death.
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
On 16/03/2020 6:21 pm, Victor Stinner wrote:
There were quick discussions about using thread local storage (TLS) to get and set the current Python thread state ("tstate"), instead of reading/setting an atomic variable (_PyRuntime.gilstate.tstate_current).
In fact, TLS already exists as "PyGILState" and PyGILState_GetThisThreadState() can already return the current thread state. But this API doesn't work currently with subinterpreters:
Just to be clear, this is what I mean by a thread local variable: https://godbolt.org/z/dpSo-Q
* https://bugs.python.org/issue10915 * https://bugs.python.org/issue15751
It's unclear to me if fixing this issue would require to add a lock, nor if it would make PyGILState_GetThisThreadState() or _PyThreadState_GET() slower.
It doesn't require a lock, and it is only two instructions (it's 5 instructions on Windows, but that's still cheap).
Moreover, currently, it's possible to have two Python thread state for the same native thread. That's needed for the transition from the main interpreter to new subinterpreter. Example: --- mainstate = PyThreadState_Get();
PyThreadState_Swap(NULL);
substate = Py_NewInterpreter(); r = PyRun_SimpleString(code); Py_EndInterpreter(substate);
PyThreadState_Swap(mainstate); ---
The visibility of a thread local variable is a strict superset of that of a function local variable. Anything you can do with a function local variable, you can do with a thread local variable.
where Py_NewInterpreter() creates a new Python thread state using PyThreadState_New() and sets it as the current Python thread state.
Maybe the subinterpreter API can evolve to never "attach" two different interpreters to the same thread: one Python thread state would only belong to a single native thread.
I would be interested to explore the option of using TLS everywhere, but first we need to solve all these tricky issues.
Again, so far, passing tstate explicitly is only done internally, so if we switch to another solution which doesn't require to pass tstate explicitly, we can do that without breaking the public C API.
--
I hope that you now have a better overview of the current state of the Python implementation and how it evolves in the last two years.
What worries me is the idea that passing the thread state around is the only way to isolate sub-interpreters. It isn't. Using thread-local variables is a better way. There may be even better approaches, but I'm not aware of them.
Victor
Le lun. 16 mars 2020 à 16:04, Victor Stinner <vstinner@python.org> a écrit :
Hi,
Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review.
Last November, I asked explicitly on python-dev if we should "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGV...
In short, the answer is yes.
There is no PEP but scatted documents. I wrote a short article to elaborate the context of this work: https://vstinner.github.io/cpython-pass-tstate.html
One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation.
--
In the meanwhile, I modified "small integer singletons" to make them "per-interpreter". So tstate->interp is now used to get small integers in longobject.c.
I also opened a discussion on other singletons (None, True, False, ...): https://bugs.python.org/issue39511
The long-term goal is to be able to run multiple isolated interpreters in parallel.
Le lun. 16 mars 2020 à 15:16, Mark Shannon <mark@hotpy.org> a écrit :
There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions.
So far, tstate should only be passed to *internal* C functions. I don't think that the public C API has been modified to pass tstate.
These changes are listed under https://bugs.python.org/issue38644.
There was also https://bugs.python.org/issue36710
I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways.
Would you mind to elaborate?
The changes add bulk to the C-API and may hurt performance.
Did you notice that in benchmarks? I would be curious to see the overhead.
These changes are also causing a lot of churn and merge conflicts (for me at least).
Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters.
Victor -- Night gathers, and now my watch begins. It shall not end until my death.
data:image/s3,"s3://crabby-images/f2cb6/f2cb6403da92e69ee6cc8c3fb58b22cdceb03681" alt=""
Le mar. 17 mars 2020 à 15:49, Mark Shannon <mark@hotpy.org> a écrit :
* https://bugs.python.org/issue10915 * https://bugs.python.org/issue15751
It's unclear to me if fixing this issue would require to add a lock, nor if it would make PyGILState_GetThisThreadState() or _PyThreadState_GET() slower.
It doesn't require a lock, and it is only two instructions (it's 5 instructions on Windows, but that's still cheap).
I'm not sure that I understand what you mean. Do you have a fix for these issues, to be able to use this API in subinterpreters? I asked if fixing these issues may need to add another lock or atomic variable. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
On 16/03/2020 3:04 pm, Victor Stinner wrote:
Hi,
Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review.
Last November, I asked explicitly on python-dev if we should "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGV...
In short, the answer is yes.
I said "no" then and gave reasons. AFAICT no one has faulted my reasoning. Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack. 1. Using a thread-local variable for the thread state requires much smaller changes to the code base. 2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks? 3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall.
There is no PEP but scatted documents. I wrote a short article to elaborate the context of this work: https://vstinner.github.io/cpython-pass-tstate.html
One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation.
I don't see how this eases the implementation of subinterpreters. Surely it makes it harder by causing merge conflicts.
--
In the meanwhile, I modified "small integer singletons" to make them "per-interpreter". So tstate->interp is now used to get small integers in longobject.c.
I also opened a discussion on other singletons (None, True, False, ...): https://bugs.python.org/issue39511
That's great, but I don't see the relevance.
The long-term goal is to be able to run multiple isolated interpreters in parallel.
An admirable goal, IMO. But how is it to be done? That is the question.
Le lun. 16 mars 2020 à 15:16, Mark Shannon <mark@hotpy.org> a écrit :
There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions.
So far, tstate should only be passed to *internal* C functions. I don't think that the public C API has been modified to pass tstate.
These changes are listed under https://bugs.python.org/issue38644.
There was also https://bugs.python.org/issue36710
I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways.
Would you mind to elaborate?
Using thread-local variables changes the API less, is more reliable and maybe faster.
The changes add bulk to the C-API and may hurt performance.
Did you notice that in benchmarks? I would be curious to see the overhead.
These changes are also causing a lot of churn and merge conflicts (for me at least).
Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters.
I don't think they *should*. When they *must*, then do that. Changes should only be made if necessary for correctness or for a significant improvement of performance. Cheers, Mark.
Victor
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Tue, 17 Mar 2020 14:47:19 +0000 Mark Shannon <mark@hotpy.org> wrote:
On 16/03/2020 3:04 pm, Victor Stinner wrote:
Hi,
Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review.
Last November, I asked explicitly on python-dev if we should "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGV...
In short, the answer is yes.
I said "no" then and gave reasons. AFAICT no one has faulted my reasoning.
Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack.
1. Using a thread-local variable for the thread state requires much smaller changes to the code base.
2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks?
3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall.
The problem here is if different subinterpreters create a thread state for the same OS thread. Sounds unlikely, perhaps, but not entirely, especially if that thread state is created from a foreign C library's callback. Regards Antoine.
data:image/s3,"s3://crabby-images/832a7/832a7d28e16a261c5f64f5c6fc6585753582feae" alt=""
On 17Mar2020 1447, Mark Shannon wrote:
On 16/03/2020 3:04 pm, Victor Stinner wrote:
In short, the answer is yes.
I said "no" then and gave reasons. AFAICT no one has faulted my reasoning.
I said "yes" then and was also not faulted.
Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack.
1. Using a thread-local variable for the thread state requires much smaller changes to the code base.
Using thread-local variables enforces a threading model on the host application, rather than working with the existing threading model. So anyone embedding is forced into *significantly* more code as a result. We can (and should) maintain a public-facing API that uses TLS to pass the current thread state around - we have compatibility constraints. But we can also add private versions that take the thread state (once you've started trying/struggling to really embed CPython, you'll happily take a dependency on "private" APIs). If the only available API requires TLS, then you're likely to see the caller wrap it all up in a function that updates TLS before calling. Or alternatively, introduce dedicated threads for running Python snippets on, and all the (dead)locking that results (yes, I've done both). Our goal as core CPython developers should be to sacrifice our own effort to reduce the effort needed by our users, not to do things that make our own lives easier but harm them.
2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks?
Undefined behaviour is totally acceptable here. We can assert in debug builds - developers who make use of this can test with debug builds.
3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall.
Compilers can optimise parameters/locals in ways that are far more efficient than they can do for anything stored outside the call stack. Especially for internal calls. Going through public/exported functions is a little more restricted in terms of optimisations, but if we identify an issue here then we can work on that then. [OTHER POST]
Just to be clear, this is what I mean by a thread local variable: https://godbolt.org/z/dpSo-Q
Showing what one particular compiler generates for one particular situation is terrible information (I won't bother calling it "evidence").
One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation.
I don't see how this eases the implementation of subinterpreters. Surely it makes it harder by causing merge conflicts.
That's a very selfish point-of-view :) It eases it because many more operations need to know the current Python "thread" in order to access things that used to be globals, such as PyTypeObject instances. Having the thread state easily and efficiently accessible does make a difference here.
The long-term goal is to be able to run multiple isolated interpreters in parallel.
An admirable goal, IMO. But how is it to be done? That is the question.
By isolating thread states properly from global state.
Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters.
I don't think they *should*. When they *must*, then do that. Changes should only be made if necessary for correctness or for a significant improvement of performance.
They must - I think Victor just chose the wrong English word there. Correctness is the first thing to fall when you access globals in multithreaded code, and the CPython code base accesses a lot of globals. Cheers, Steve
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
On 17/03/2020 3:38 pm, Steve Dower wrote:
On 17Mar2020 1447, Mark Shannon wrote:
On 16/03/2020 3:04 pm, Victor Stinner wrote:
In short, the answer is yes.
I said "no" then and gave reasons. AFAICT no one has faulted my reasoning.
I said "yes" then and was also not faulted.
I'll do that now then ;) The accessibility of a thread-local variable is a strict superset of that of a function-local variable. Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter.
Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack.
1. Using a thread-local variable for the thread state requires much smaller changes to the code base.
Using thread-local variables enforces a threading model on the host application, rather than working with the existing threading model. So anyone embedding is forced into *significantly* more code as a result.
Putting a value in a function-local variable enforces stronger restrictions than putting it in a thread-local variable. I am proposing that we *don't* change the API. How does that make more work for anyone using the API?
We can (and should) maintain a public-facing API that uses TLS to pass the current thread state around - we have compatibility constraints. But we can also add private versions that take the thread state (once you've started trying/struggling to really embed CPython, you'll happily take a dependency on "private" APIs).
Again, I am requesting that we *don't* change the API. Not changing the API maintains backwards compatibility better than changing it, surely.
If the only available API requires TLS, then you're likely to see the caller wrap it all up in a function that updates TLS before calling. Or alternatively, introduce dedicated threads for running Python snippets on, and all the (dead)locking that results (yes, I've done both).
All the platforms that we support have thread-local storage. If a platform doesn't have threads at all, then thread-local just degenerates to a global.
Our goal as core CPython developers should be to sacrifice our own effort to reduce the effort needed by our users, not to do things that make our own lives easier but harm them.
Indeed. We might want to speed Python up a bit as well :)
2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks?
Undefined behaviour is totally acceptable here. We can assert in debug builds - developers who make use of this can test with debug builds.
I'm not sure what your point is about undefined behaviour.
3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall.
Compilers can optimise parameters/locals in ways that are far more efficient than they can do for anything stored outside the call stack. Especially for internal calls. Going through public/exported functions is a little more restricted in terms of optimisations, but if we identify an issue here then we can work on that then.
Please skip the patronizing "how compilers work" stuff. I know how register allocators work.
[OTHER POST]
Just to be clear, this is what I mean by a thread local variable: https://godbolt.org/z/dpSo-Q
Showing what one particular compiler generates for one particular situation is terrible information (I won't bother calling it "evidence").
The particular situation is the use of a thread-local variable, which is the point under discussion. Here's the links for clang and MSVC: https://godbolt.org/z/YnbbqD https://www.godbolt.ms/z/9nQEqf
One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation.
I don't see how this eases the implementation of subinterpreters. Surely it makes it harder by causing merge conflicts.
That's a very selfish point-of-view :)
Why? Merge conflicts are a problem for everyone.
It eases it because many more operations need to know the current Python "thread" in order to access things that used to be globals, such as PyTypeObject instances. Having the thread state easily and efficiently accessible does make a difference here.
Indeed. I want it to be easily and efficiently accessible. Putting it a thread-local variable does both. For additional efficiency `_PyThreadState_GET()` can be defined as `inline`.
The long-term goal is to be able to run multiple isolated interpreters in parallel.
An admirable goal, IMO. But how is it to be done? That is the question.
By isolating thread states properly from global state.
Yes. But why do you think passing a value down the stack does that better than a thread-local variable?
Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters.
I don't think they *should*. When they *must*, then do that. Changes should only be made if necessary for correctness or for a significant improvement of performance.
They must - I think Victor just chose the wrong English word there. Correctness is the first thing to fall when you access globals in multithreaded code, and the CPython code base accesses a lot of globals.
Indeed, but this discussion has nothing to do with global variables, it is about how we access thread-state. Cheers, Mark.
Cheers, Steve
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, Mar 18, 2020 at 3:50 AM Mark Shannon <mark@hotpy.org> wrote:
The accessibility of a thread-local variable is a strict superset of that of a function-local variable.
Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter.
And by that logic, globals are even more capable. I don't understand your point. Isn't the purpose of the tstate parameters to avoid the problem of being unable to have multiple tstates within the same OS thread? I think I've missed something here. ChrisA
data:image/s3,"s3://crabby-images/832a7/832a7d28e16a261c5f64f5c6fc6585753582feae" alt=""
On 17Mar2020 1803, Chris Angelico wrote:
On Wed, Mar 18, 2020 at 3:50 AM Mark Shannon <mark@hotpy.org> wrote:
The accessibility of a thread-local variable is a strict superset of that of a function-local variable.
Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter.
And by that logic, globals are even more capable. I don't understand your point. Isn't the purpose of the tstate parameters to avoid the problem of being unable to have multiple tstates within the same OS thread? I think I've missed something here.
You haven't. Separating the Python thread from the "physical" thread is indeed the point. Cheers, Steve
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
On 17/03/2020 7:00 pm, Steve Dower wrote:
On 17Mar2020 1803, Chris Angelico wrote:
On Wed, Mar 18, 2020 at 3:50 AM Mark Shannon <mark@hotpy.org> wrote:
The accessibility of a thread-local variable is a strict superset of that of a function-local variable.
Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter.
And by that logic, globals are even more capable. I don't understand your point. Isn't the purpose of the tstate parameters to avoid the problem of being unable to have multiple tstates within the same OS thread? I think I've missed something here.
You haven't. Separating the Python thread from the "physical" thread is indeed the point.
That seems like a strawman argument; maybe I'm misunderstanding Steve. It seems to me that Steve is implying that using thread-local variables somehow prevents that separation. It does not. The point I'm making is that adding `tstate` parameters everywhere is unnecessary. Using a thread local variable is much less intrusive and is at least as capable. Cheers, Mark.
data:image/s3,"s3://crabby-images/db629/db629be3404f4763b49bef32351c2f48b5904d7c" alt=""
Mark Shannon wrote:
The point I'm making is that adding `tstate` parameters everywhere is unnecessary. Using a thread local variable is much less intrusive and is at least as capable.
Objectively speaking, what would be the specific difference(s) in behavior and performance between using a thread-local variable vs passing a tstate parameter? A more in-depth "pros vs cons" analysis of each approach might help to make the argument points more clear, as it seems some of the parties involved in the discussion are talking past each other to some degree, or at least aren't completely on the same page. It might also help to find a specific C-API function to benchmark, to show how substantial the supposed performance differences could be. Intuitively, it seems like passing around an extra parameter would add some penalty, but it's not clear to me as to how much that would *realistically* add in performance cost compared to accessing the threadstate through a thread-local variable. It seems like it depends on exactly how often `_ThreadState_GET()` would have to be called, but that's not at all obvious considering the scope of the changes.
Using a thread local variable is much less intrusive and is at least as capable.
I'm not sure that I'm clear on how the tstate parameter would be additionally intrusive, considering that the changes only affect the internal parts of the C-API. I would agree if it were to be arbitrarily added to public API since it would break backwards compatibility, but that doesn't seem to be the case here. From the perspective of being intrusive, it seems like it's adding some code churn and a potential merge conflict to resolve locally. That seems like an inconvenience to us rather than something that would be intrusive to users. Also, regarding the thread-local variable example demonstrated w/ godbolt (thanks for providing those :-) ), it seems like it would be very clear to access the *current* threadstate that was created by the *main* interpreter, but how would you go about resolving the following?: 1) A different threadstate from another interpreter and OS thread 2) A different threadstate created from the same OS thread, but from a different interpreter (as Antoine mentioned earlier) From my understanding, those scenarios seem to be a significant concern in the context of subinterpreters. If we could also see example(s) which address those scenarios with a thread-local variable instead of a tstate parameter, I think it would allow for more objective comparison between them. Regards, Kyle Stanley On Wed, Mar 18, 2020 at 6:36 AM Mark Shannon <mark@hotpy.org> wrote:
On 17/03/2020 7:00 pm, Steve Dower wrote:
On 17Mar2020 1803, Chris Angelico wrote:
On Wed, Mar 18, 2020 at 3:50 AM Mark Shannon <mark@hotpy.org> wrote:
The accessibility of a thread-local variable is a strict superset of that of a function-local variable.
Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter.
And by that logic, globals are even more capable. I don't understand your point. Isn't the purpose of the tstate parameters to avoid the problem of being unable to have multiple tstates within the same OS thread? I think I've missed something here.
You haven't. Separating the Python thread from the "physical" thread is indeed the point.
That seems like a strawman argument; maybe I'm misunderstanding Steve. It seems to me that Steve is implying that using thread-local variables somehow prevents that separation. It does not.
The point I'm making is that adding `tstate` parameters everywhere is unnecessary. Using a thread local variable is much less intrusive and is at least as capable.
Cheers, Mark. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CLRCMGUC... Code of Conduct: http://python.org/psf/codeofconduct/
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Wed., 18 Mar. 2020, 8:36 pm Mark Shannon, <mark@hotpy.org> wrote:
On 17/03/2020 7:00 pm, Steve Dower wrote:
On 17Mar2020 1803, Chris Angelico wrote:
On Wed, Mar 18, 2020 at 3:50 AM Mark Shannon <mark@hotpy.org> wrote:
The accessibility of a thread-local variable is a strict superset of that of a function-local variable.
Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter.
And by that logic, globals are even more capable. I don't understand your point. Isn't the purpose of the tstate parameters to avoid the problem of being unable to have multiple tstates within the same OS thread? I think I've missed something here.
You haven't. Separating the Python thread from the "physical" thread is indeed the point.
That seems like a strawman argument; maybe I'm misunderstanding Steve. It seems to me that Steve is implying that using thread-local variables somehow prevents that separation. It does not.
The point I'm making is that adding `tstate` parameters everywhere is unnecessary. Using a thread local variable is much less intrusive and is at least as capable.
And that's why the primary public API will still work that way. What the tstate work is doing is taking what are currently global functions that work on implicitly passed state and effectively turning them into C-style methods of the tstate struct. The idea is to make it so that most internal APIs won't care about the currently active thread state, they will instead operate on the tstate that they're given. In the cases where it does matter, they'll either enforce the equivalence through debug assertions or explicit checks, or else they'll take care of changing the state themselves (rather than relying on the caller to do it). The way things are now, it's like instead of defining a normal Python class instance and calling methods on it, we were instead defining a whole host of module level functions that directly manipulated passive data-only class instances stored in a thread local. The state we're trying to get to is more akin to the way Decimal context objects work: there's an active one stored in the thread local state, and functions for retrieving and changing the active context, but once you *have* a reference to a context object, you can use it directly via its methods without needing to activate it first. Cheers, Nick.
data:image/s3,"s3://crabby-images/8c8cc/8c8ccb69b07acfd42f699246c4a44e6942e9d33a" alt=""
On 17 Mar 2020, at 16:43, Mark Shannon <mark@hotpy.org> wrote:
On 17/03/2020 3:38 pm, Steve Dower wrote:
On 17Mar2020 1447, Mark Shannon wrote:
On 16/03/2020 3:04 pm, Victor Stinner wrote:
In short, the answer is yes.
I said "no" then and gave reasons. AFAICT no one has faulted my reasoning. I said "yes" then and was also not faulted.
I'll do that now then ;)
The accessibility of a thread-local variable is a strict superset of that of a function-local variable.
Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter.
Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack.
1. Using a thread-local variable for the thread state requires much smaller changes to the code base. Using thread-local variables enforces a threading model on the host application, rather than working with the existing threading model. So anyone embedding is forced into *significantly* more code as a result.
Putting a value in a function-local variable enforces stronger restrictions than putting it in a thread-local variable.
I am proposing that we *don't* change the API. How does that make more work for anyone using the API?
Are you saying that all interpreters can use the same thread-local-variable for tstate? Or do you need N thread-local-variables for N interpreters? Barry
We can (and should) maintain a public-facing API that uses TLS to pass the current thread state around - we have compatibility constraints. But we can also add private versions that take the thread state (once you've started trying/struggling to really embed CPython, you'll happily take a dependency on "private" APIs).
Again, I am requesting that we *don't* change the API. Not changing the API maintains backwards compatibility better than changing it, surely.
If the only available API requires TLS, then you're likely to see the caller wrap it all up in a function that updates TLS before calling. Or alternatively, introduce dedicated threads for running Python snippets on, and all the (dead)locking that results (yes, I've done both).
All the platforms that we support have thread-local storage. If a platform doesn't have threads at all, then thread-local just degenerates to a global.
Our goal as core CPython developers should be to sacrifice our own effort to reduce the effort needed by our users, not to do things that make our own lives easier but harm them.
Indeed. We might want to speed Python up a bit as well :)
2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks? Undefined behaviour is totally acceptable here. We can assert in debug builds - developers who make use of this can test with debug builds.
I'm not sure what your point is about undefined behaviour.
3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall. Compilers can optimise parameters/locals in ways that are far more efficient than they can do for anything stored outside the call stack. Especially for internal calls. Going through public/exported functions is a little more restricted in terms of optimisations, but if we identify an issue here then we can work on that then.
Please skip the patronizing "how compilers work" stuff. I know how register allocators work.
[OTHER POST]
Just to be clear, this is what I mean by a thread local variable: https://godbolt.org/z/dpSo-Q Showing what one particular compiler generates for one particular situation is terrible information (I won't bother calling it "evidence").
The particular situation is the use of a thread-local variable, which is the point under discussion.
Here's the links for clang and MSVC:
https://godbolt.org/z/YnbbqD https://www.godbolt.ms/z/9nQEqf
One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation.
I don't see how this eases the implementation of subinterpreters. Surely it makes it harder by causing merge conflicts. That's a very selfish point-of-view :)
Why? Merge conflicts are a problem for everyone.
It eases it because many more operations need to know the current Python "thread" in order to access things that used to be globals, such as PyTypeObject instances. Having the thread state easily and efficiently accessible does make a difference here.
Indeed. I want it to be easily and efficiently accessible. Putting it a thread-local variable does both. For additional efficiency `_PyThreadState_GET()` can be defined as `inline`.
The long-term goal is to be able to run multiple isolated interpreters in parallel.
An admirable goal, IMO. But how is it to be done? That is the question. By isolating thread states properly from global state.
Yes. But why do you think passing a value down the stack does that better than a thread-local variable?
Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters.
I don't think they *should*. When they *must*, then do that. Changes should only be made if necessary for correctness or for a significant improvement of performance. They must - I think Victor just chose the wrong English word there. Correctness is the first thing to fall when you access globals in multithreaded code, and the CPython code base accesses a lot of globals.
Indeed, but this discussion has nothing to do with global variables, it is about how we access thread-state.
Cheers, Mark.
Cheers, Steve
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ERBHVDSI... Code of Conduct: http://python.org/psf/codeofconduct/
data:image/s3,"s3://crabby-images/f2cb6/f2cb6403da92e69ee6cc8c3fb58b22cdceb03681" alt=""
Le mar. 17 mars 2020 à 15:47, Mark Shannon <mark@hotpy.org> a écrit :
There is no PEP but scatted documents. I wrote a short article to elaborate the context of this work: https://vstinner.github.io/cpython-pass-tstate.html
One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation.
I don't see how this eases the implementation of subinterpreters.
The overall work helps to identify what is currently shared by subinterpreters and prepares the code base to "unshare" data. For me, it's more explicit that _PyErr_Occurred(tstate) gets the current exception from the tstate argument, than PyErr_Occurred() which implicitly gets the current Python thread state. Before the work on subinterpreters started, it was really hard to guess where all the states come from and if anything was shared. I mean, tons of states were shared. Just because it's easier to implement like that! Eric Snow started by moving many core states into _PyRuntimeState or PyInterpreterState between Python 3.7 and Python 3.8. Previously, many functions had an implicit state which was shared by all interpreters. Now, it's still shared, but it becomes easier to reorganize the code to have one state per interpreter. For example, that's how I managed to move the state of the gc module per-interpreter: https://bugs.python.org/issue36854 To be able to run two interpreters in parallel, it's very important to not share anything for best performances but also for correctness. Concurrent accesses of shared states with no locking mecasim can introduce inconsistencies. One solution for that is to not share anything ;-)
I also opened a discussion on other singletons (None, True, False, ...): https://bugs.python.org/issue39511
That's great, but I don't see the relevance.
I proposed a fix which moves None and True singletons into PyInterpreterState. If we choose this solution, it would mean that more and more functions will have to pass tstate (ex: none = tstate->interp->none). But let's discuss the details in https://bugs.python.org/issue39511 Other solutions have been proposed ;-)
The long-term goal is to be able to run multiple isolated interpreters in parallel.
An admirable goal, IMO. But how is it to be done? That is the question.
Passing states explicitly in one task. Another task is to convert modules to the multiphase initialization PEP 489. It allows to have multiple isolated instances of the same C extension module, one per subinterpreter. No more shared module state! Eric Snow tracks the different tasks under the https://github.com/ericsnowcurrently/multi-core-python/ project.
Changes should only be made if necessary for correctness or for a significant improvement of performance.
It's for correctness ;-) Victor -- Night gathers, and now my watch begins. It shall not end until my death.
data:image/s3,"s3://crabby-images/f2cb6/f2cb6403da92e69ee6cc8c3fb58b22cdceb03681" alt=""
Mark opened https://bugs.python.org/issue39978 "Vectorcall implementation should conform to PEP 590" where he wrote that passing tstate explicitly slows down PyObject_Vectorcall(). Victor Le lun. 16 mars 2020 à 15:16, Mark Shannon <mark@hotpy.org> a écrit :
Hi,
There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions.
These changes are listed under https://bugs.python.org/issue38644.
I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways.
The changes add bulk to the C-API and may hurt performance. These changes are also causing a lot of churn and merge conflicts (for me at least).
So can we please stop making these changes, at least now?
Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review.
Cheers, Mark. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KGBXVVJQ... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
participants (8)
-
Antoine Pitrou
-
Barry Scott
-
Chris Angelico
-
Kyle Stanley
-
Mark Shannon
-
Nick Coghlan
-
Steve Dower
-
Victor Stinner