Moving away from _Py_IDENTIFIER().
I'm planning on moving us to a simpler, more efficient alternative to _Py_IDENTIFIER(), but want to see if there are any objections first before moving ahead. Also see https://bugs.python.org/issue46541. _Py_IDENTIFIER() was added in 2011 to replace several internal string object caches and to support cleaning up the cached objects during finalization. A number of "private" functions (each with a _Py_Identifier param) were added at that time, mostly corresponding to existing functions that take PyObject* or char*. Note that at present there are several hundred uses of _Py_IDENTIFIER(), including a number of duplicates. My plan is to replace our use of _Py_IDENTIFIER() with statically initialized string objects (as fields under _PyRuntimeState). That involves the following: * add a PyUnicodeObject field (not a pointer) to _PyRuntimeState for each string that currently uses _Py_IDENTIFIER() (or _Py_static_string()) * statically initialize each object as part of the initializer for _PyRuntimeState * add a macro to look up a given global string * update each location that currently uses _Py_IDENTIFIER() to use the new macro instead Pros: * reduces indirection (and extra calls) for C-API functions that need the strings (making the code a little easier to understand and speeding it up) * the objects are referenced from a fixed address in the static data section instead of the heap (speeding things up and allowing the C compiler to optimize better) * there is no lazy allocation (or lookup, etc.) so there are fewer possible failures when the objects get used (thus less error return checking) * saves memory (at little, at least) * if needed, the approach for per-interpreter is simpler * helps us get rid of several hundred static variables throughout the code base * allows us to get rid of _Py_IDENTIFIER() and a bunch of related C-API functions * "deep frozen" modules can use the global strings * commonly-used strings could be pre-allocated by adding _PyRuntimeState fields for them Cons: * a little less convenient: adding a global string requires modifying a separate file from the one where you actually want to use the string * strings can get "orphaned" (I'm planning on checking in CI) * some strings may never get used for any given ./python invocation (not that big a difference though) I have a PR up (https://github.com/python/cpython/pull/30928) that adds the global strings and replaces use of _Py_IDENTIFIER() in our code base, except for in non-builtin stdlib extension modules. (Those will be handled separately if we proceed.) The PR also adds a CI check for "orphaned" strings. It leaves _Py_IDENTIFIER() for now, but disallows any Py_BUILD_CORE code from using it. With that change I'm seeing a 1% improvement in performance (see https://github.com/faster-cpython/ideas/issues/230). I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs. However, at least one PyPI project (blender) is using _Py_IDENTIFIER(). So, before we could get rid of it, we'd first have to deal with that project (and any others). To sum up, I wanted to see if there are any objections before I start merging anything. Thanks! -eric
On Wed, Feb 2, 2022 at 3:41 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions.
FTR, here is the (private/internal) C-API affected by getting rid of _Py_IDENTIFIER(): * 21 C-API functions with `_Py_Identifer` parameters - would be dropped + _PyUnicode_FromId() + _PyUnicode_EqualToASCIIId() + _PyObject_CallMethodId() + _PyObject_CallMethodId_SizeT() + _PyObject_CallMethodIdObjArgs() + _PyObject_VectorcallMethodId() + _PyObject_CallMethodIdNoArgs() + _PyObject_CallMethodIdOneArg() + _PyEval_GetBuiltinId() + _PyDict_GetItemId() + _PyDict_SetItemId() + _PyDict_DelItemId() + _PyDict_ContainsId() + _PyImport_GetModuleId() + _PyType_LookupId() + _PyObject_LookupSpecial() + _PyObject_GetAttrId() + _PyObject_SetAttrId() + _PyObject_LookupAttrId() + _PySys_GetObjectId() + _PySys_SetObjectId() * 7 new internal functions to replace the _Py*Id() functions that didn't already have a normal counterpart + _PyObject_CallMethodObj() + _PyObject_IsSingleton() + _PyEval_GetBuiltin() + _PySys_SetAttr() + _PyObject_LookupSpecial() (with PyObject* param) + _PyDict_GetItemWithError() + _PyObject_CallMethod() * the runtime state related to identifiers - would be dropped * _Py_Identifier, _Py_IDENTIFIER(), _Py_static_string() - would be dropped -eric
On Wed, Feb 2, 2022 at 11:49 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
FTR, here is the (private/internal) C-API affected by getting rid of _Py_IDENTIFIER():
* 21 C-API functions with `_Py_Identifer` parameters - would be dropped + _PyUnicode_FromId() + _PyUnicode_EqualToASCIIId() + _PyObject_CallMethodId() + _PyObject_CallMethodId_SizeT() + _PyObject_CallMethodIdObjArgs() + _PyObject_VectorcallMethodId() + _PyObject_CallMethodIdNoArgs() + _PyObject_CallMethodIdOneArg() + _PyEval_GetBuiltinId() + _PyDict_GetItemId() + _PyDict_SetItemId() + _PyDict_DelItemId() + _PyDict_ContainsId() + _PyImport_GetModuleId() + _PyType_LookupId() + _PyObject_LookupSpecial() + _PyObject_GetAttrId() + _PyObject_SetAttrId() + _PyObject_LookupAttrId() + _PySys_GetObjectId() + _PySys_SetObjectId()
I am well aware that we don't provide any backward compatibility warranty on the private and internal C API. *But*. People love to use they anyway. In the top 5000 PyPI projects, I found 11 projects using them: * Cython-0.29.26 (and so indirect most projects using Cython) * datatable-1.0.0 * frozendict-2.2.0 * multidict-6.0.2 * mypy-0.931 * pickle5-0.0.12 * pysqlite3-0.4.6 * ruamel.ordereddict-0.4.15 * scandir-1.10.0 * typed_ast-1.5.2 * zodbpickle-2.2.0 They use the these 17 functions: * _PyDict_ContainsId() * _PyDict_DelItemId() * _PyDict_GetItemId() * _PyDict_GetItemIdWithError() * _PyDict_SetItemId() * _PyEval_GetBuiltinId() * _PyObject_CallMethodId() * _PyObject_CallMethodIdNoArgs() * _PyObject_CallMethodIdObjArgs() * _PyObject_CallMethodIdOneArg() * _PyObject_GetAttrId() * _PyObject_LookupAttrId() * _PyObject_LookupSpecial() * _PyObject_SetAttrId() * _PySys_GetObjectId() * _PyUnicode_EqualToASCIIId() * _PyUnicode_FromId() If the _Py_IDENTIFIER() API is removed, it would be *nice* to provide a migrate path (tool?) to help these projects moving away the _Py_IDENTIFIER() API. Or at least do the work to update these 11 projects. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
Victor Stinner wrote:
On Wed, Feb 2, 2022 at 11:49 PM Eric Snow ericsnowcurrently@gmail.com wrote:
In the top 5000 PyPI projects, I found 11 projects using them: * Cython-0.29.26 (and so indirect most projects using Cython)
I believe Cython is (for once) a false alarm here. I don't think it uses any of those functions. It has a comment that contains "_PyObject_LookupAttrId" - https://github.com/cython/cython/blob/8d2df028bf9536942b60670bf0aa80d6acc746... - This is a bit of code that we'd adapted from CPython and the comment just explains what the original line is. It's possible I've missed something of course, in which case let me know. David
Oh right, Cython seems to be a false positive. A code search found 3 references to __Pyx_PyObject_LookupSpecial(): PYPI-2022-01-26-TOP-5000/Cython-0.29.26.tar.gz: Cython-0.29.26/Cython/Compiler/ExprNodes.py: lookup_func_name = '__Pyx_PyObject_LookupSpecial' PYPI-2022-01-26-TOP-5000/Cython-0.29.26.tar.gz: Cython-0.29.26/Cython/Compiler/Nodes.py: code.putln("%s = __Pyx_PyObject_LookupSpecial(%s, %s); %s" % ( PYPI-2022-01-26-TOP-5000/Cython-0.29.26.tar.gz: Cython-0.29.26/Cython/Utility/ObjectHandling.c: static CYTHON_INLINE PyObject* __Pyx_PyObject_LookupSpecial(PyObject* obj, PyObject* attr_name) { Oh, that's not "_PyObject_LookupSpecial()", it doesn't use the _Py_Identifier type: static CYTHON_INLINE PyObject* __Pyx_PyObject_LookupSpecial(PyObject* obj, PyObject* attr_name) { ... } Victor On Thu, Feb 3, 2022 at 7:27 PM <dw-git@d-woods.co.uk> wrote:
Victor Stinner wrote:
On Wed, Feb 2, 2022 at 11:49 PM Eric Snow ericsnowcurrently@gmail.com wrote:
In the top 5000 PyPI projects, I found 11 projects using them: * Cython-0.29.26 (and so indirect most projects using Cython)
I believe Cython is (for once) a false alarm here. I don't think it uses any of those functions.
It has a comment that contains "_PyObject_LookupAttrId" - https://github.com/cython/cython/blob/8d2df028bf9536942b60670bf0aa80d6acc746... - This is a bit of code that we'd adapted from CPython and the comment just explains what the original line is.
It's possible I've missed something of course, in which case let me know.
David _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5WICJJ5G... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
Victor Stinner schrieb am 03.02.22 um 22:46:
Oh right, Cython seems to be a false positive.
A code search found 3 references to __Pyx_PyObject_LookupSpecial():
PYPI-2022-01-26-TOP-5000/Cython-0.29.26.tar.gz: Cython-0.29.26/Cython/Compiler/ExprNodes.py: lookup_func_name = '__Pyx_PyObject_LookupSpecial' PYPI-2022-01-26-TOP-5000/Cython-0.29.26.tar.gz: Cython-0.29.26/Cython/Compiler/Nodes.py: code.putln("%s = __Pyx_PyObject_LookupSpecial(%s, %s); %s" % ( PYPI-2022-01-26-TOP-5000/Cython-0.29.26.tar.gz: Cython-0.29.26/Cython/Utility/ObjectHandling.c: static CYTHON_INLINE PyObject* __Pyx_PyObject_LookupSpecial(PyObject* obj, PyObject* attr_name) {
Oh, that's not "_PyObject_LookupSpecial()", it doesn't use the _Py_Identifier type:
static CYTHON_INLINE PyObject* __Pyx_PyObject_LookupSpecial(PyObject* obj, PyObject* attr_name) { ... }
Correct. We (intentionally) have our own way to intern strings and do not depend on CPython's identifier framework. Stefan
Eric Snow schrieb am 04.02.22 um 17:35:
On Fri, Feb 4, 2022 at 8:21 AM Stefan Behnel wrote:
Correct. We (intentionally) have our own way to intern strings and do not depend on CPython's identifier framework.
You're talking about __Pyx_StringTabEntry (and __Pyx_InitString())?
Yes, that's what we generate. The C code parsing is done here: https://github.com/cython/cython/blob/79637b23da77732e753b1e1ab5669b3e29978b... The deduplication is a bit complex on our side because it needs to handle Python source encodings, and also distinguishes between identifiers (that become 'str' in Py2), plain Unicode strings and byte strings. You don't need most of that for plain C code. But it's done here: https://github.com/cython/cython/blob/79637b23da77732e753b1e1ab5669b3e29978b... And then there's a whole bunch of code that helps in getting Unicode character code points and arbitrary byte values in very long strings pushed through C compilers, while keeping it mostly readable for interested users. :) https://github.com/cython/cython/blob/master/Cython/Compiler/StringEncoding.... You probably don't need that either, as long as you only deal with ASCII strings. Any way, have fun. Feel free to ask if I can help. Stefan
On Thu, Feb 3, 2022 at 7:17 AM Victor Stinner <vstinner@python.org> wrote:
In the top 5000 PyPI projects, I found 11 projects using them: [snip[ They use the these 17 functions:
Thanks! That is super helpful.
If the _Py_IDENTIFIER() API is removed, it would be *nice* to provide a migrate path (tool?) to help these projects moving away the _Py_IDENTIFIER() API. Or at least do the work to update these 11 projects.
If something like _Py_IDENTIFIER() provides genuine value then we should consider a proper public API. Otherwise I agree that we should work with those projects to stop using it. I guess either way they should stop using the "private" API. :) -eric
+1 for overall. On Thu, Feb 3, 2022 at 7:45 AM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs. However, at least one PyPI project (blender) is using _Py_IDENTIFIER(). So, before we could get rid of it, we'd first have to deal with that project (and any others).
It would be nice to provide something similar to _PY_IDENTIFIER, but designed (and documented) for 3rd party modules like this. ``` typedef struct { Py_IDENTIFIER(foo); ... } Modstate; ... // in some func Modstate *state = (Modstate*)PyModule_GetState(module); PyObject_GetAttr(o, PY_IDENTIFIER_STR(state->foo)); ... // m_free() static void mod_free(PyObject *module) { Modstate *state = (Modstate*)PyModule_GetState(module); Py_IDENTIFIER_CLEAR(state->foo); } ``` -- Inada Naoki <songofacandy@gmail.com>
By the way, Argument Clinic now produces way faster calling conventions than hand-written METH_VARARGS with PyArg_ParseTuple(). It would be make this tool available to 3rd party projects. Either extract it and put it on PyPI, but it means that Python will need to Python and pip install something to build itself... not good? Or we can add it to stdlib. IMO we need to write more tests and more documentation. Right now, test_clinic is "light". Another issue is that it requires on the currently privte _PyArg_Parser structure. By the way, this structure is causing crashes if sub-interpreters are run in parallel (one GIL per interpreter) because of the _PyArg_Parser.kwtuple tuple of keyword parameter names (tuple of str). If Eric's tool becomes successful inside CPython, we may consider to make it available to 3rd party projects as well. Such tool and Argument Clinic and not so different than Cython which generates C code to ease the development of C extensions modules. Victor On Thu, Feb 3, 2022 at 7:50 AM Inada Naoki <songofacandy@gmail.com> wrote:
+1 for overall.
On Thu, Feb 3, 2022 at 7:45 AM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs. However, at least one PyPI project (blender) is using _Py_IDENTIFIER(). So, before we could get rid of it, we'd first have to deal with that project (and any others).
It would be nice to provide something similar to _PY_IDENTIFIER, but designed (and documented) for 3rd party modules like this.
``` typedef struct { Py_IDENTIFIER(foo); ... } Modstate; ... // in some func Modstate *state = (Modstate*)PyModule_GetState(module); PyObject_GetAttr(o, PY_IDENTIFIER_STR(state->foo)); ... // m_free() static void mod_free(PyObject *module) { Modstate *state = (Modstate*)PyModule_GetState(module); Py_IDENTIFIER_CLEAR(state->foo); } ```
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZZ5QOZDO... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
On Thu, Feb 3, 2022 at 6:37 AM Victor Stinner <vstinner@python.org> wrote:
By the way, Argument Clinic now produces way faster calling conventions than hand-written METH_VARARGS with PyArg_ParseTuple(). It would be make this tool available to 3rd party projects.
Either extract it and put it on PyPI, but it means that Python will need to Python and pip install something to build itself... not good?
Since we check in the generated source you only need Python installed while developing CPython, not to build it. And since that's already a requirement to build the docs, freeze code, etc. I don't think that's an unreasonable requirement. We also don't have to break Argument Clinic out of our repo just to make it available externally. There's nothing saying we can't have a subdirectory of code that we directly rely on and publish separately for our development needs. Could even add blurb into the repo that way if we wanted to. -Brett
Or we can add it to stdlib. IMO we need to write more tests and more documentation. Right now, test_clinic is "light". Another issue is that it requires on the currently privte _PyArg_Parser structure. By the way, this structure is causing crashes if sub-interpreters are run in parallel (one GIL per interpreter) because of the _PyArg_Parser.kwtuple tuple of keyword parameter names (tuple of str).
If Eric's tool becomes successful inside CPython, we may consider to make it available to 3rd party projects as well. Such tool and Argument Clinic and not so different than Cython which generates C code to ease the development of C extensions modules.
Victor
On Thu, Feb 3, 2022 at 7:50 AM Inada Naoki <songofacandy@gmail.com> wrote:
+1 for overall.
On Thu, Feb 3, 2022 at 7:45 AM Eric Snow <ericsnowcurrently@gmail.com>
wrote:
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs. However, at least one PyPI project (blender) is using _Py_IDENTIFIER(). So, before we could get rid of it, we'd first have to deal with that project (and any others).
It would be nice to provide something similar to _PY_IDENTIFIER, but designed (and documented) for 3rd party modules like this.
``` typedef struct { Py_IDENTIFIER(foo); ... } Modstate; ... // in some func Modstate *state = (Modstate*)PyModule_GetState(module); PyObject_GetAttr(o, PY_IDENTIFIER_STR(state->foo)); ... // m_free() static void mod_free(PyObject *module) { Modstate *state = (Modstate*)PyModule_GetState(module); Py_IDENTIFIER_CLEAR(state->foo); } ```
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZZ5QOZDO... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QNP7RDER... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, 2022-02-03 at 15:32 +0100, Victor Stinner wrote:
By the way, Argument Clinic now produces way faster calling conventions than hand-written METH_VARARGS with PyArg_ParseTuple(). It would be make this tool available to 3rd party projects.
Either extract it and put it on PyPI, but it means that Python will need to Python and pip install something to build itself... not good?
Or we can add it to stdlib. IMO we need to write more tests and more documentation. Right now, test_clinic is "light". Another issue is that it requires on the currently privte _PyArg_Parser structure. By the way, this structure is causing crashes if sub-interpreters are run in parallel (one GIL per interpreter) because of the _PyArg_Parser.kwtuple tuple of keyword parameter names (tuple of str).
If Eric's tool becomes successful inside CPython, we may consider to make it available to 3rd party projects as well. Such tool and Argument Clinic and not so different than Cython which generates C code to ease the development of C extensions modules.
It would be great to have a clear API available to store constants like this, for NumPy these currently are: * mainly interned strings * Some custom singletons which are currently defined in Python (including exceptions, but not so important) * Python functions that are called from C. I am pretty sure that some of the places that do not have module state available, but likely many (most?) of them could be refactored (e.g. a converter function that needs a singleton: it isn't terrible to manually converter later on). Many will be in methods. It would be great to have a "blessed" solution for kwarg parsing. I had shunned argument clinic because it felt too internal, probably CPython specific, and it seemed like you basically need to check in the generated code for each supported Python version... The current solution in NumPy [1] is a bit limited and uses those statics to store the interned strings. And it duplicates Python logic that would be nice to not duplicate. Cheers, Sebastian [1] For the curious, the API for NumPy looks like this: /* * Declare a function static struct for string creation/interning * and caching other prep (generated on the first call. * `npy_parse_arugments` is a macro, but only to insert the cache name) */ NPY_PREPARE_ARGPARSER; if (npy_parse_arguments(funcname, args, len_args, kwnames "", NULL, &pos_only_obj, "kw1", &Converter, &kw1_val, "|kw2", NULL, &kw2_obj, "$kw_only1", NULL, &kw_only2_obj, NULL, NULL, NULL) < 0) { return NULL; } So it is very limited to converters (e.g. no "i" or "!O" convenience), since we don't use that much anyway. One annoyance is that "i" and "s" don't have a clear "converter" and that converters can't raise an error message that includes the function and parameter name.
Victor
On Thu, Feb 3, 2022 at 7:50 AM Inada Naoki <songofacandy@gmail.com> wrote:
+1 for overall.
On Thu, Feb 3, 2022 at 7:45 AM Eric Snow < ericsnowcurrently@gmail.com> wrote:
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs. However, at least one PyPI project (blender) is using _Py_IDENTIFIER(). So, before we could get rid of it, we'd first have to deal with that project (and any others).
It would be nice to provide something similar to _PY_IDENTIFIER, but designed (and documented) for 3rd party modules like this.
``` typedef struct { Py_IDENTIFIER(foo); ... } Modstate; ... // in some func Modstate *state = (Modstate*)PyModule_GetState(module); PyObject_GetAttr(o, PY_IDENTIFIER_STR(state->foo)); ... // m_free() static void mod_free(PyObject *module) { Modstate *state = (Modstate*)PyModule_GetState(module); Py_IDENTIFIER_CLEAR(state->foo); } ```
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZZ5QOZDO... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Feb 2, 2022 at 11:50 PM Inada Naoki <songofacandy@gmail.com> wrote:
It would be nice to provide something similar to _PY_IDENTIFIER, but designed (and documented) for 3rd party modules like this.
I suppose I'd like to know what the value of _Py_IDENTIFIER() is for 3rd party modules. They can already use PyUnicode_InternFromString() to get a "global" object and then store it in their module state. I would not expect _Py_IDENTIFIER() to provide much of an advantage over that. Perhaps I'm missing something? If there is a real benefit then we should definitely figure out a good public API for it (if the current private one isn't sufficient). I won't be authoring that PEP though. :) -eric
On Thu, Feb 3, 2022 at 2:50 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Wed, Feb 2, 2022 at 11:50 PM Inada Naoki <songofacandy@gmail.com> wrote:
It would be nice to provide something similar to _PY_IDENTIFIER, but designed (and documented) for 3rd party modules like this.
I suppose I'd like to know what the value of _Py_IDENTIFIER() is for 3rd party modules. They can already use PyUnicode_InternFromString() to get a "global" object and then store it in their module state. I would not expect _Py_IDENTIFIER() to provide much of an advantage over that. Perhaps I'm missing something?
If there is a real benefit then we should definitely figure out a good public API for it (if the current private one isn't sufficient). I won't be authoring that PEP though. :)
Why not read through some of that code and see what they are doing with it? I imagine one advantage is that _Py_IDENTIFIER() can be used entirely local to a function. E.g. (from _operator.c): _Py_IDENTIFIER(partial); functools = PyImport_ImportModule("functools"); if (!functools) return NULL; partial = _PyObject_GetAttrId(functools, &PyId_partial); That's convenient since it means they don't have to pass module state around. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
On Thu, Feb 3, 2022 at 4:01 PM Guido van Rossum <guido@python.org> wrote:
Why not read through some of that code and see what they are doing with it?
Yep, I'm planning on it.
I imagine one advantage is that _Py_IDENTIFIER() can be used entirely local to a function.
Yeah, they'd have to put something like this in their module init: state->partial_str = PyUnicode_InternFromString("partial"); if (state->partial_str == NULL) { return NULL; }
E.g. (from _operator.c):
_Py_IDENTIFIER(partial); functools = PyImport_ImportModule("functools"); if (!functools) return NULL; partial = _PyObject_GetAttrId(functools, &PyId_partial);
That's convenient since it means they don't have to pass module state around.
I might call that cheating. :) For an extension module this means they are storing a little bit of their state in the runtime/interpreter state instead of in their module state. Is there precedent for that with any of our other API? Regardless, the status quo certainly is simpler (if they aren't already using module state in the function). Without _Py_IDENTIFER() it would look like: functools = PyImport_ImportModule("functools"); if (!functools) return NULL; my_struct *state = (my_struct*)PyModule_GetState(module); if (state == NULL) { Py_DECREF(functools); return NULL; } partial = PyObject_GetAttr(functools, state->partial_str); If they are already using the module state in their function then the code would be simpler: functools = PyImport_ImportModule("functools"); if (!functools) return NULL; partial = PyObject_GetAttr(functools, state->partial_str); -eric
On Thu, Feb 3, 2022 at 11:49 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I suppose I'd like to know what the value of _Py_IDENTIFIER() is for 3rd party modules.
_Py_IDENTIFIER() API is simple and short, Python creates the Python string object automatically, *and* clears this object at exit. It's convenient. On the other side, functions like PyDict_GetItemString() call PyUnicode_FromString(): it allocates memory on the heap, it decodes the bytes string from UTF-8, and then deallocates the memory. It's inefficient just for a dict lookup. In the _Py_IDENTIFIER() API, _PyUnicode_FromId() calls PyUnicode_InternInPlace() which makes dict lookup even more efficient. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Thu, Feb 3, 2022 at 3:49 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I suppose I'd like to know what the value of _Py_IDENTIFIER() is for 3rd party modules.
Between Guido, Victor, Stefan, and Sebastian, I'm getting the sense that a public replacement for _Py_IDENTIFER() would be worth pursuing. Considering that it would probably help numpy move toward subinterpreter support, I may work on this after all. :) (For core CPython we'll still benefit from the statically initialized strings, AKA gh-30928.) -eric
On 2/4/2022 5:37 PM, Eric Snow wrote:
On Thu, Feb 3, 2022 at 3:49 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I suppose I'd like to know what the value of _Py_IDENTIFIER() is for 3rd party modules.
Between Guido, Victor, Stefan, and Sebastian, I'm getting the sense that a public replacement for _Py_IDENTIFER() would be worth pursuing. Considering that it would probably help numpy move toward subinterpreter support, I may work on this after all. :)
(For core CPython we'll still benefit from the statically initialized strings, AKA gh-30928.)
For me, I'd love to see a statically allocated string type (for a real example that's used in Windows, check out [1], specifically when he gets to the fast-pass strings). Essentially, a bare minimum struct around a char* (and/or wchar_t*) that acts as a PyUnicodeObject but doesn't ever allocate anything on the heap. This would also be helpful for config strings, which are often static but need to be copied around a lot, and good for passthrough strings that a native extension or host app might insert and receive back unmodified. Because there's nothing to deallocate, it can be "created" and stored however the caller likes. As soon as Python code does anything with it other than passing it around, a regular PyUnicodeObject is allocated (just like the HSTRING example). I'd expect usage to look very much like the intern examples earlier in the thread, but if we actually return a whole struct then we aren't on the hook for the allocations. Cheers, Steve [1]: https://devblogs.microsoft.com/oldnewthing/20160615-00/?p=93675
You *can* allocate unicode objects statically. We do it in deepfreeze, and Eric's PR under discussion here ( https://github.com/python/cpython/pull/30928) does it. I wonder if a better solution than that PR wouldn't be to somehow change the implementation of _Py_IDENTIFIER() to do that, and make the special 'Id' APIs just aliases for the corresponding unicode-object-based APIs? It wouldn't be ABI compatible, but none of those APIs are in the stable ABI. (Is there a requirement that an Id only contain ASCII characters (i.e., 7-bit)?) On Fri, Feb 4, 2022 at 12:52 PM Steve Dower <steve.dower@python.org> wrote:
On 2/4/2022 5:37 PM, Eric Snow wrote:
On Thu, Feb 3, 2022 at 3:49 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I suppose I'd like to know what the value of _Py_IDENTIFIER() is for 3rd party modules.
Between Guido, Victor, Stefan, and Sebastian, I'm getting the sense that a public replacement for _Py_IDENTIFER() would be worth pursuing. Considering that it would probably help numpy move toward subinterpreter support, I may work on this after all. :)
(For core CPython we'll still benefit from the statically initialized strings, AKA gh-30928.)
For me, I'd love to see a statically allocated string type (for a real example that's used in Windows, check out [1], specifically when he gets to the fast-pass strings).
Essentially, a bare minimum struct around a char* (and/or wchar_t*) that acts as a PyUnicodeObject but doesn't ever allocate anything on the heap. This would also be helpful for config strings, which are often static but need to be copied around a lot, and good for passthrough strings that a native extension or host app might insert and receive back unmodified.
Because there's nothing to deallocate, it can be "created" and stored however the caller likes. As soon as Python code does anything with it other than passing it around, a regular PyUnicodeObject is allocated (just like the HSTRING example).
I'd expect usage to look very much like the intern examples earlier in the thread, but if we actually return a whole struct then we aren't on the hook for the allocations.
Cheers, Steve
[1]: https://devblogs.microsoft.com/oldnewthing/20160615-00/?p=93675 _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DGX4GBMD... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
On Fri, Feb 4, 2022 at 8:25 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Fri, Feb 4, 2022, 16:03 Guido van Rossum <guido@python.org> wrote:
I wonder if a better solution than that PR wouldn't be to somehow change the implementation of _Py_IDENTIFIER() to do that,
Yeah, I had the same realization today. I'm going to try it out.
I updated _Py_IDENTIFIER() to use a statically initialized string object and it isn't too bad. The tricky thing is that PyASCIIObject expects to the data to be an array after the object. So the field must be a pre-sized array (like I did in gh-30928). That makes things messier. The alternative is to do what Steve is suggesting. I ran the benchmarks and making _Py_IDENTIFIER() a statically initialized object makes things 2% slower (instead of 1% faster). There are a few things I could do to speed that up a little, but at best we'd get back to performance-neutral. -eric
Thanks for trying it! I'm curious why it would be slower (perhaps less locality? perhaps the ...Id... APIs have some other trick up their sleeve?) but since it's also messier and less backwards compatible than just leaving _Py_IDENTIFIER alone and just not using it, I'd say let's not spend more time on that alternative and just focus on the two other horses still in the race: immortal objects or what you have now. On Mon, Feb 7, 2022 at 5:41 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Fri, Feb 4, 2022 at 8:25 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Fri, Feb 4, 2022, 16:03 Guido van Rossum <guido@python.org> wrote:
I wonder if a better solution than that PR wouldn't be to somehow change the implementation of _Py_IDENTIFIER() to do that,
Yeah, I had the same realization today. I'm going to try it out.
I updated _Py_IDENTIFIER() to use a statically initialized string object and it isn't too bad. The tricky thing is that PyASCIIObject expects to the data to be an array after the object. So the field must be a pre-sized array (like I did in gh-30928). That makes things messier. The alternative is to do what Steve is suggesting.
I ran the benchmarks and making _Py_IDENTIFIER() a statically initialized object makes things 2% slower (instead of 1% faster). There are a few things I could do to speed that up a little, but at best we'd get back to performance-neutral.
-eric
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
On Tue, Feb 8, 2022 at 1:47 PM Guido van Rossum <guido@python.org> wrote:
Thanks for trying it! I'm curious why it would be slower (perhaps less locality? perhaps the ...Id... APIs have some other trick up their sleeve?) but since it's also messier and less backwards compatible than just leaving _Py_IDENTIFIER alone and just not using it, I'd say let's not spend more time on that alternative and just focus on the two other horses still in the race: immortal objects or what you have now.
I think it's because statically allocated strings are not interned. I think deepfreeze should stop using statically allocated strings for interned strings too. -- Inada Naoki <songofacandy@gmail.com>
Inada Naoki schrieb am 08.02.22 um 06:15:
On Tue, Feb 8, 2022 at 1:47 PM Guido van Rossum wrote:
Thanks for trying it! I'm curious why it would be slower (perhaps less locality? perhaps the ...Id... APIs have some other trick up their sleeve?) but since it's also messier and less backwards compatible than just leaving _Py_IDENTIFIER alone and just not using it, I'd say let's not spend more time on that alternative and just focus on the two other horses still in the race: immortal objects or what you have now.
I think it's because statically allocated strings are not interned.
That would explain such a difference.
I think deepfreeze should stop using statically allocated strings for interned strings too.
… or consider the statically allocated strings the interned string value. Unless another one already exists, but that shouldn't be the case for CPython internal strings. Stefan
On 04Feb2022 2303, Guido van Rossum wrote:
You *can* allocate unicode objects statically. We do it in deepfreeze, and Eric's PR under discussion here (https://github.com/python/cpython/pull/30928 <https://github.com/python/cpython/pull/30928>) does it. I wonder if a better solution than that PR wouldn't be to somehow change the implementation of _Py_IDENTIFIER() to do that, and make the special 'Id' APIs just aliases for the corresponding unicode-object-based APIs? It wouldn't be ABI compatible, but none of those APIs are in the stable ABI.
Indeed. I'm sure you can appreciate how I skimmed over that bit :) (plus I've mostly kept up with this by chatting with Eric, rather than reviewing all the code) The guts of the code in question for those who don't want to find it: #define STRUCT_FOR_ASCII_STR(LITERAL) \ struct { \ PyASCIIObject _ascii; \ uint8_t _data[sizeof(LITERAL)]; \ } Which is then used inside another struct to statically allocate all the objects. The gap this has with what I'd like to see is that it will only work for compile-time strings. If instead it could work for an arbitrary uint8_t pointer, rather than an embedded array, that allows even runtime strings to be very cheaply passed into the interpreter (and yes, the caller has to manage the lifetime, but that's pretty easy). So this is perfect for the needs we have for our internal API, but I wouldn't want it to become _the_ public API. (I also wouldn't want to have two public APIs that are subtly different, because in a few years time everyone will forget why and be very confused about it all.) Not sure I'm motivated enough to build it myself, but I'm getting there ;) so perhaps I'll put something together that does what I'd like and give us concrete things to discuss.
(Is there a requirement that an Id only contain ASCII characters (i.e., 7-bit)?)
I don't think it's unreasonable to require that for these internal identifier strings, but it would be a shame to do so for arbitrary strings. Cheers, Steve
On Sat, Feb 5, 2022 at 5:18 AM Steve Dower <steve.dower@python.org> wrote:
The gap this has with what I'd like to see is that it will only work for compile-time strings. If instead it could work for an arbitrary uint8_t pointer, rather than an embedded array, that allows even runtime strings to be very cheaply passed into the interpreter (and yes, the caller has to manage the lifetime, but that's pretty easy).
What's the use case for that, that isn't covered by PyUnicode_FromString()? -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
On 2/5/2022 4:09 PM, Guido van Rossum wrote:
On Sat, Feb 5, 2022 at 5:18 AM Steve Dower <steve.dower@python.org <mailto:steve.dower@python.org>> wrote:
The gap this has with what I'd like to see is that it will only work for compile-time strings. If instead it could work for an arbitrary uint8_t pointer, rather than an embedded array, that allows even runtime strings to be very cheaply passed into the interpreter (and yes, the caller has to manage the lifetime, but that's pretty easy).
What's the use case for that, that isn't covered by PyUnicode_FromString()?
No dynamic memory allocation (and hence, no deallocation) for any case where the original string is already managed, which is really quite common. Access to the memory manager is what requires thread affinity/the GIL for primitive object construction, and string copies - especially with transcoding - are the expensive part. For cases where strings are just passed around but never manipulated (e.g. a lot of filesystem operations, or runtime/interpreter configuration), the string may never have to be decoded at all. It's almost as good as a tagged pointer, but without breaking our existing object model (assuming all the PyUnicode_* functions learn how to handle them, which is necessary). But it's purely a transparent optimisation for most users, with an added opportunity for those using native APIs and probably decent complexity for us as maintainers. There are a lot of edge cases to handle that I'm sure people will use to shoot down the idea, which is why rather than debate details here I'd rather build it and define its boundaries of usefulness, though for now there's plenty of stuff I'd rather do than both of these, so it remains an idea for now :) Cheers, Steve
Yes, please build it; I can't quite follow what you're getting at (probably since I don't know the Windows APIs). On Mon, Feb 7, 2022 at 4:39 AM Steve Dower <steve.dower@python.org> wrote:
On 2/5/2022 4:09 PM, Guido van Rossum wrote:
On Sat, Feb 5, 2022 at 5:18 AM Steve Dower <steve.dower@python.org <mailto:steve.dower@python.org>> wrote:
The gap this has with what I'd like to see is that it will only work for compile-time strings. If instead it could work for an arbitrary uint8_t pointer, rather than an embedded array, that allows even runtime strings to be very cheaply passed into the interpreter (and yes, the caller has to manage the lifetime, but that's pretty easy).
What's the use case for that, that isn't covered by PyUnicode_FromString()?
No dynamic memory allocation (and hence, no deallocation) for any case where the original string is already managed, which is really quite common. Access to the memory manager is what requires thread affinity/the GIL for primitive object construction, and string copies - especially with transcoding - are the expensive part.
For cases where strings are just passed around but never manipulated (e.g. a lot of filesystem operations, or runtime/interpreter configuration), the string may never have to be decoded at all. It's almost as good as a tagged pointer, but without breaking our existing object model (assuming all the PyUnicode_* functions learn how to handle them, which is necessary).
But it's purely a transparent optimisation for most users, with an added opportunity for those using native APIs and probably decent complexity for us as maintainers. There are a lot of edge cases to handle that I'm sure people will use to shoot down the idea, which is why rather than debate details here I'd rather build it and define its boundaries of usefulness, though for now there's plenty of stuff I'd rather do than both of these, so it remains an idea for now :)
Cheers, Steve
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
On 2 Feb 2022, at 23:41, Eric Snow <ericsnowcurrently@gmail.com> wrote:
I […]
Cons:
* a little less convenient: adding a global string requires modifying a separate file from the one where you actually want to use the string * strings can get "orphaned" (I'm planning on checking in CI) * some strings may never get used for any given ./python invocation (not that big a difference though)
The first two cons can probably be fixed by adding some indirection, with some markers at the place of use and a script that uses those to generate the C definitions. Although my gut feeling is that adding a the CI check you mention is good enough and adding the tooling for generating code isn’t worth the additional complexity. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/
On Thu, Feb 3, 2022 at 6:46 AM Ronald Oussoren <ronaldoussoren@mac.com> wrote:
Although my gut feeling is that adding a the CI check you mention is good enough and adding the tooling for generating code isn’t worth the additional complexity.
Yeah, I came to the same conclusion. :) -eric
Ronald Oussoren via Python-Dev schrieb am 03.02.22 um 14:46:
On 2 Feb 2022, at 23:41, Eric Snow wrote: * a little less convenient: adding a global string requires modifying a separate file from the one where you actually want to use the string * strings can get "orphaned" (I'm planning on checking in CI) * some strings may never get used for any given ./python invocation (not that big a difference though)
The first two cons can probably be fixed by adding some indirection, with some markers at the place of use and a script that uses those to generate the C definitions.
Although my gut feeling is that adding a the CI check you mention is good enough and adding the tooling for generating code isn’t worth the additional complexity.
It's what we do in Cython, and it works really well there. It's very straight forward, you just write something like PYUNICODE("some text here") PYIDENT("somename") in your C code and Cython creates a deduplicated global string table from them and replaces the string constants with the corresponding global variables. (We have two different names because an identifier in Py2 is 'str', not 'unicode'.) Now, the thing with CPython is that the C sources where the replacement would take place are VCS controlled. And a script that replaces the identifiers would have to somehow make sure that the new references do not get renamed, which would lead to non-local changes when strings are added. What you could try is to number the identifiers, i.e. use a macro like _Py_STR(123, "some text here") where you manually add a new identifier as _Py_STR("some text here") and the number is filled in automatically by a script that finds all of them, deduplicates, and adds new identifiers at the end, adding 1 to the maximum number that it finds. That makes sure that identifiers that already have an ID number will not be touched, deleted strings disappear automatically, and non-local changes are prevented. Defining the _Py_STR() macro as #define _Py_STR(id, text) (_Py_global_string_table[id]) or #define _Py_STR(id, text) (_Py_global_string_table##id) would also give you a compile error if someone forgets to run the script. Stefan
On Wed, Feb 2, 2022 at 11:43 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
My plan is to replace our use of _Py_IDENTIFIER() with statically initialized string objects (as fields under _PyRuntimeState). That involves the following:
* add a PyUnicodeObject field (not a pointer) to _PyRuntimeState for each string that currently uses _Py_IDENTIFIER() (or _Py_static_string())
In bpo-39465, I made the _PyUnicode_FromId() compatible with running sub-interpreters in parallel (one GIL per interpreter). A "static" PyUnicodeObject would have to share the reference count between sub-interpreters, whereas Py_INCREF/Py_DECREF are not thread-safe: there is lock to prevent data races. Is there a way to push the "immortal objects" strategy discussed in bpo-40255? The deepfreeze already pushed some functions related to that, like _PyObject_IMMORTAL_INIT() in the internal C API. Moreover... deepfreeze already produces "immortal" PyUnicodeObject strings using the "ob_refcnt = 999999999" hack. IMO we should decide on a strategy. Either we move towards immortal objects (modify Py_INCREF/Py_DECREF to not modify the ref count if an object is immortal), or we make sure that no Python is shared between two Python interpreters.
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs.
Is it required by your work on static strings, or is it more about removing the API which would no longer be consumed by Python itself? If it's not required, would it make sense to follow the PEP 387 deprecation (mark functions as deprecated, document the deprecation, and wait 2 releases to remove it)? Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Thu, Feb 3, 2022 at 6:31 AM Victor Stinner <vstinner@python.org> wrote:
On Wed, Feb 2, 2022 at 11:43 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
My plan is to replace our use of _Py_IDENTIFIER() with statically initialized string objects (as fields under _PyRuntimeState). That involves the following:
* add a PyUnicodeObject field (not a pointer) to _PyRuntimeState for each string that currently uses _Py_IDENTIFIER() (or _Py_static_string())
In bpo-39465, I made the _PyUnicode_FromId() compatible with running sub-interpreters in parallel (one GIL per interpreter).
A "static" PyUnicodeObject would have to share the reference count between sub-interpreters, whereas Py_INCREF/Py_DECREF are not thread-safe: there is lock to prevent data races.
Is there a way to push the "immortal objects" strategy discussed in bpo-40255? The deepfreeze already pushed some functions related to that, like _PyObject_IMMORTAL_INIT() in the internal C API. Moreover... deepfreeze already produces "immortal" PyUnicodeObject strings using the "ob_refcnt = 999999999" hack.
IMO we should decide on a strategy. Either we move towards immortal objects (modify Py_INCREF/Py_DECREF to not modify the ref count if an object is immortal), or we make sure that no Python is shared between two Python interpreters.
Either we come to an agreement as a group (for against the concept and then how to do it), or if there's a stalemate it goes to the SC (with potentially a PEP to properly explain the nuances of the proposed solution). -Brett
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs.
Is it required by your work on static strings, or is it more about removing the API which would no longer be consumed by Python itself?
If it's not required, would it make sense to follow the PEP 387 deprecation (mark functions as deprecated, document the deprecation, and wait 2 releases to remove it)?
Victor -- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FFQ6N7K2... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Feb 3, 2022 at 7:26 AM Victor Stinner <vstinner@python.org> wrote:
In bpo-39465, I made the _PyUnicode_FromId() compatible with running sub-interpreters in parallel (one GIL per interpreter).
A "static" PyUnicodeObject would have to share the reference count between sub-interpreters, whereas Py_INCREF/Py_DECREF are not thread-safe: there is lock to prevent data races.
Yeah, if we end up not being able to safely share the global string objects between interpreters then we would move them under PyInterpreterState. Currently I'm putting them under _PyRuntimeState. Doing that might reduce the performance benefits a little, since Py_GET_GLOBAL_STRING() would have to look up the interpreter to use (or we'd have to pass it in). That doesn't seem like much of a penalty though and doesn't impact the other benefits of the change.
Is there a way to push the "immortal objects" strategy discussed in bpo-40255?
I'm planning on circling back to that next week.
The deepfreeze already pushed some functions related to that, like _PyObject_IMMORTAL_INIT() in the internal C API. Moreover... deepfreeze already produces "immortal" PyUnicodeObject strings using the "ob_refcnt = 999999999" hack.
Note we only set the value really high as a safety precaution since these objects are all statically allocated. Eddie Elizondo's proposal involves a number of other key points, including keeping the refcount from changing.
IMO we should decide on a strategy. Either we move towards immortal objects (modify Py_INCREF/Py_DECREF to not modify the ref count if an object is immortal), or we make sure that no Python is shared between two Python interpreters.
+1 The catch is that things get messier when we make some objects per-interpreter while others stay runtime-global. I'm going to write a bit more about this next week, but the best strategy will probably be to first consolidate all the global objects under _PyRuntimeState first and then move them to PyInterpreterState all at once when we can do it safely.
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs.
Is it required by your work on static strings, or is it more about removing the API which would no longer be consumed by Python itself?
It is definitely not required for that. Rather, we won't need it any more so we should benefit from getting rid of it. The only blocker is that some 3rd party modules are using it.
If it's not required, would it make sense to follow the PEP 387 deprecation (mark functions as deprecated, document the deprecation, and wait 2 releases to remove it)?
If you think it's worth it. It's a private API. I'd rather work to get 3rd party modules off it and then move on sooner. -eric
On Wed, Feb 2, 2022 at 2:48 PM Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'm planning on moving us to a simpler, more efficient alternative to _Py_IDENTIFIER(), but want to see if there are any objections first before moving ahead. Also see https://bugs.python.org/issue46541.
_Py_IDENTIFIER() was added in 2011 to replace several internal string object caches and to support cleaning up the cached objects during finalization. A number of "private" functions (each with a _Py_Identifier param) were added at that time, mostly corresponding to existing functions that take PyObject* or char*. Note that at present there are several hundred uses of _Py_IDENTIFIER(), including a number of duplicates.
My plan is to replace our use of _Py_IDENTIFIER() with statically initialized string objects (as fields under _PyRuntimeState). That involves the following:
* add a PyUnicodeObject field (not a pointer) to _PyRuntimeState for each string that currently uses _Py_IDENTIFIER() (or _Py_static_string()) * statically initialize each object as part of the initializer for _PyRuntimeState * add a macro to look up a given global string * update each location that currently uses _Py_IDENTIFIER() to use the new macro instead
Pros:
* reduces indirection (and extra calls) for C-API functions that need the strings (making the code a little easier to understand and speeding it up) * the objects are referenced from a fixed address in the static data section instead of the heap (speeding things up and allowing the C compiler to optimize better) * there is no lazy allocation (or lookup, etc.) so there are fewer possible failures when the objects get used (thus less error return checking) * saves memory (at little, at least) * if needed, the approach for per-interpreter is simpler * helps us get rid of several hundred static variables throughout the code base * allows us to get rid of _Py_IDENTIFIER() and a bunch of related C-API functions * "deep frozen" modules can use the global strings * commonly-used strings could be pre-allocated by adding _PyRuntimeState fields for them
Cons:
* a little less convenient: adding a global string requires modifying a separate file from the one where you actually want to use the string * strings can get "orphaned" (I'm planning on checking in CI) * some strings may never get used for any given ./python invocation (not that big a difference though)
I have a PR up (https://github.com/python/cpython/pull/30928) that adds the global strings and replaces use of _Py_IDENTIFIER() in our code base, except for in non-builtin stdlib extension modules. (Those will be handled separately if we proceed.) The PR also adds a CI check for "orphaned" strings. It leaves _Py_IDENTIFIER() for now, but disallows any Py_BUILD_CORE code from using it.
With that change I'm seeing a 1% improvement in performance (see https://github.com/faster-cpython/ideas/issues/230).
I'd also like to actually get rid of _Py_IDENTIFIER(), along with other related API including ~14 (private) C-API functions. Dropping all that helps reduce maintenance costs. However, at least one PyPI project (blender) is using _Py_IDENTIFIER(). So, before we could get rid of it, we'd first have to deal with that project (and any others).
datapoint: an internal code search turns up blender, multidict, and typed_ast as open source users of _Py_IDENTIFIER . Easy to clean up as PRs. There are a couple of internal uses as well, all of which are similarly easy to address and are only in code that is expected to need API cleanup tweaks between CPython versions. Overall I think addressing the broader strategy question among the performance focused folks is worthwhile though. -gps
To sum up, I wanted to see if there are any objections before I start merging anything. Thanks!
-eric _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DNMZAMB4... Code of Conduct: http://python.org/psf/codeofconduct/
Am 03.02.22 um 22:41 schrieb Gregory P. Smith:
datapoint: an internal code search turns up blender, multidict, and typed_ast as open source users of _Py_IDENTIFIER . Easy to clean up as PRs. There are a couple of internal uses as well, all of which are similarly easy to address and are only in code that is expected to need API cleanup tweaks between CPython versions.
typed_ast is mostly legacy, to be used for parsing Python code < Python 3.8. It's end of life is scheduled when Python 3.7 end of lifes in June 2023. I believe we can easily get away with just copying the _Py_IDENTIFIER macros over to its source code in typed_ast. - Sebastian
participants (12)
-
Brett Cannon
-
dw-git@d-woods.co.uk
-
Eric Snow
-
Gregory P. Smith
-
Guido van Rossum
-
Inada Naoki
-
Ronald Oussoren
-
Sebastian Berg
-
Sebastian Rittau
-
Stefan Behnel
-
Steve Dower
-
Victor Stinner