On 2019-11-12 23:03, Victor Stinner wrote:
Hi,
Are you ok to modify internal C functions to pass explicitly tstate?
In short, yes, but: - don't make things slower :) - don't break the public API or the stable ABI I'm a fan of explicitly passing state everywhere, rather than keeping it in "global" variables. Currently, surprisingly many internal functions do a PyThreadState_GET for themselves, then call another function that does the same. That's wasteful, but impossible to change in the public API. Your changes (of which I only saw a very limited subset) seem to follow a simple rule: public API functions call PyThreadState_GET, and then call internal functions that pass it around. That's sounds beautifully easy to explain! Later, we'll just need to find a way to make the tstate API public (and opt-in). The "per-interpreter None", however, is a different issue. I don't see how that can be done without breaking the stable ABI. I still think immortal immutable objects could be shared across interpreters.
--
I started to modify internal C functions to pass explicitly "tstate" when calling C functions: the Python thread state (PyThreadState). Example of C code (after my changes):
if (_Py_EnterRecursiveCall(tstate, " while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); _Py_LeaveRecursiveCall(tstate); return _Py_CheckFunctionResult(tstate, callable, result, NULL);
In Python 3.8, the tstate is implicit:
if (Py_EnterRecursiveCall(" while calling a Python object")) { return NULL; } PyObject *result = (*call)(callable, args, kwargs); Py_LeaveRecursiveCall(); return _Py_CheckFunctionResult(callable, result, NULL);
There are different reasons to pass explicitly tstate, but my main motivation is to rework Python code base to move away from implicit global states to states passed explicitly, to implement the PEP 554 "Multiple Interpreters in the Stdlib". In short, the final goal is to run multiple isolated Python interpreters in the same process: run pure Python code on multiple CPUs in parallel with a single process (whereas multiprocessing runs multiple processes).
Currently, subinterpreters are a hack: they still share a lot of things, the code base is not ready to implement isolated interpreters with one "GIL" (interpreter lock) per interpreter, and to run multiple interpreters in parallel. Many _PyRuntimeState fields (the global _PyRuntime variable) should be moved to PyInterpreterState (or maybe PyThreadState): per interpreter.
Another simpler but more annoying example are Py_None and Py_True singletons which are globals. We cannot share these singletons between interpreters because updating their reference counter would be a performance bottleneck. If we put a "superglobal-GIL" to ensure that Py_None reference counter remains consistent, it would basically "serialize" all threads, rather than running them in parallel.
The idea of passing tstate to internal C functions is to prepare code to get the per-interpreter None from tstate.
tstate is basically the "root" to access all states which are per interpreter. For example, PyInterpreterState can be read from tstate->interp.
Right now, tstate is only passed to a few functions, but you should expect to see it passed to way more functions later, once more structures will be moved to PyInterpreterState.
--
On my latest merged PR 17052 ("Add _PyObject_VectorcallTstate()"), Mark Shannon wrote: "I don't see how this could ever be faster, nor do I see how it is more correct." https://github.com/python/cpython/pull/17052#issuecomment-552538438
Currently, tstate is get using these internal APIs:
#define _PyRuntimeState_GetThreadState(runtime) \ ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current)) #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)
or using public APIs:
PyAPI_FUNC(PyThreadState *) PyThreadState_Get(void); #define PyThreadState_GET() PyThreadState_Get()
I dislike _PyThreadState_GET() for 2 reasons:
* it relies on the _PyRuntime global variable: I would prefer to avoid global variables * it uses an atomic operation which can become a perofrmance issue when more and more code will require tstate
--
An alternative would be to use PyGILState_GetThisThreadState() which uses a thread local state (TLS) variable to get the Python thread state ("tstate"), rather that _PyRuntime atomic variable. Except that the PyGILState API doesn't support subinterpreters yet :-(
https://bugs.python.org/issue15751 "Support subinterpreters in the GIL state API" is open since 2012.
Note: While the GIL is released, _PyThreadState_GET() is NULL, whereas PyGILState_GetThisThreadState() is non-NULL.
--
Links:
* https://pythoncapi.readthedocs.io/runtime.html : my notes on moving globals to per interpreter states * https://bugs.python.org/issue36710 * https://bugs.python.org/issue38644
Victor