[Python-Dev] Moving away from _Py_IDENTIFIER().

Feb. 2, 2022

      I'm planning on moving us to a simpler, more efficient alternative to
_Py_IDENTIFIER(), but want to see if there are any objections first
before moving ahead.  Also see https://bugs.python.org/issue46541.

_Py_IDENTIFIER() was added in 2011 to replace several internal string
object caches and to support cleaning up the cached objects during
finalization.  A number of "private" functions (each with a
_Py_Identifier param) were added at that time, mostly corresponding to
existing functions that take PyObject* or char*.  Note that at present
there are several hundred uses of _Py_IDENTIFIER(), including a number
of duplicates.

My plan is to replace our use of _Py_IDENTIFIER() with statically
initialized string objects (as fields under _PyRuntimeState).  That
involves the following:

* add a PyUnicodeObject field (not a pointer) to _PyRuntimeState for
each string that currently uses _Py_IDENTIFIER() (or
_Py_static_string())
* statically initialize each object as part of the initializer for
_PyRuntimeState
* add a macro to look up a given global string
* update each location that currently uses _Py_IDENTIFIER() to use the
new macro instead

Pros:

* reduces indirection (and extra calls) for C-API functions that need
the strings (making the code a little easier to understand and
speeding it up)
* the objects are referenced from a fixed address in the static data
section instead of the heap (speeding things up and allowing the C
compiler to optimize better)
* there is no lazy allocation (or lookup, etc.) so there are fewer
possible failures when the objects get used (thus less error return
checking)
* saves memory (at little, at least)
* if needed, the approach for per-interpreter is simpler
* helps us get rid of several hundred static variables throughout the code base
* allows us to get rid of _Py_IDENTIFIER() and a bunch of related
C-API functions
* "deep frozen" modules can use the global strings
* commonly-used strings could be pre-allocated by adding
_PyRuntimeState fields for them

Cons:

* a little less convenient: adding a global string requires modifying
a separate file from the one where you actually want to use the string
* strings can get "orphaned" (I'm planning on checking in CI)
* some strings may never get used for any given ./python invocation
(not that big a difference though)

I have a PR up (https://github.com/python/cpython/pull/30928) that
adds the global strings and replaces use of _Py_IDENTIFIER() in our
code base, except for in non-builtin stdlib extension modules.  (Those
will be handled separately if we proceed.)  The PR also adds a CI
check for "orphaned" strings.  It leaves _Py_IDENTIFIER() for now, but
disallows any Py_BUILD_CORE code from using it.

With that change I'm seeing a 1% improvement in performance (see
https://github.com/faster-cpython/ideas/issues/230).

I'd also like to actually get rid of _Py_IDENTIFIER(), along with
other related API including ~14 (private) C-API functions.  Dropping
all that helps reduce maintenance costs.  However, at least one PyPI
project (blender) is using _Py_IDENTIFIER().  So, before we could get
rid of it, we'd first have to deal with that project (and any others).

To sum up, I wanted to see if there are any objections before I start
merging anything.  Thanks!

-eric

[Python-Dev] Moving away from _Py_IDENTIFIER().

Eric Snow