static variables in CPython - duplicated _Py_IDENTIFIERs?

I just ran an analysis of static variable definitions in CPython code, using clang, on Ubuntu and Windows. The results should be available here:
As I understand it, _Py_IDENTIFIER instances are supposed to hold constant strings that are used in Python - e.g. "__class__", "__dict__" and so on. I noticed that there are numerous duplicates of these - e.g. 8 instances of __name__, 11 instances of __dict__, and so on - each instance is defined as static in its source file and so completely distinct from the others.
I realise the overall amount of memory used by these structures isn't large, but is there any particular benefit to having these multiple copies? The current situation seems a little untidy, at least. What would be the disadvantage of making them extern in the headers and allocating them once in some consts.c module? After all, they seem for the most part to be well-known constant strings that don't need to be encapsulated in any particular C compilation unit.
Regards,
Vinay Sajip

From my understanding, _Py_IDENTIFIER was designed for static usage. The overhead is quite low; keeping it as a private (static) module-level internal variable helps to decouple things. This target is very important for keeping public API as small as possible.
On Fri, Sep 20, 2019 at 10:32 PM Vinay Sajip via Python-Dev python-dev@python.org wrote:
I just ran an analysis of static variable definitions in CPython code, using clang, on Ubuntu and Windows. The results should be available here:
As I understand it, _Py_IDENTIFIER instances are supposed to hold constant strings that are used in Python - e.g. "__class__", "__dict__" and so on. I noticed that there are numerous duplicates of these - e.g. 8 instances of __name__, 11 instances of __dict__, and so on - each instance is defined as static in its source file and so completely distinct from the others.
I realise the overall amount of memory used by these structures isn't large, but is there any particular benefit to having these multiple copies? The current situation seems a little untidy, at least. What would be the disadvantage of making them extern in the headers and allocating them once in some consts.c module? After all, they seem for the most part to be well-known constant strings that don't need to be encapsulated in any particular C compilation unit.
Regards,
Vinay Sajip _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GVQOMWXU...

On Fri, Sep 20, 2019 at 1:00 PM Andrew Svetlov andrew.svetlov@gmail.com wrote:
This target is very important for keeping public API as small as possible.
Right, I'm pretty sure that right now Python doesn't have any way to share symbols between .c files without also exposing them in the C API.
This is fixable using "symbol visibility" features, and it would be nice to have the option to share stuff between our own C files without also sharing it with the world, for lots of reasons. But it might be necessary to implement that first before doing anything to share _Py_IDENTIFIERs.
-n

Right, I'm pretty sure that right now Python doesn't have any way to
share symbols between .c files without also exposing them in the C API.
On other C projects I've worked on, the public API is expressed in one set of header files, and internal APIs that need to be exposed across modules are described in a different set of internal header files, and developers who incorrectly use internal APIs by including the internal headers could see breakage when the internals change ... excuse my naïveté, as I haven't done much at Python's C level - does this discipline/approach not apply to CPython?
Regards,
Vinay Sajip

On Fri, Sep 20, 2019 at 2:58 PM Vinay Sajip via Python-Dev python-dev@python.org wrote:
Right, I'm pretty sure that right now Python doesn't have any way to
share symbols between .c files without also exposing them in the C API.
On other C projects I've worked on, the public API is expressed in one set of header files, and internal APIs that need to be exposed across modules are described in a different set of internal header files, and developers who incorrectly use internal APIs by including the internal headers could see breakage when the internals change ... excuse my naïveté, as I haven't done much at Python's C level - does this discipline/approach not apply to CPython?
Visibility in C is complicated :-). The level I'm talking about is symbol visibility, which is determined by the linker, not by headers. If a symbol is exported, then even if you hide the headers, it's still part of the library ABI, can still collide with user symbols, can still by accessed by determined users, etc. It's still fixable, but it requires some newer tools like -fvisibility=hidden that work differently across different platforms, and so far no-one's done the work to sort out the details.
-n

requires some newer tools like -fvisibility=hidden that work differently across different platforms, and so far no-one's done the work to sort out the details.
I've started looking at this, but quite apart from the specifics of applying -fvisibility=hidden, there are some things that aren't yet clear to me about the intent behind some of our symbol definitions. For example, the file Include/fileutils.h contains the definitions
PyAPI_FUNC(wchar_t *) Py_DecodeLocale(const char *arg, size_t *size);
and
PyAPI_FUNC(int) _Py_DecodeLocaleEx(const char *arg, wchar_t **wstr, size_t *wlen, const char **reason, int current_locale, _Py_error_handler errors);
However, only the first of these is documented, though the definition via PyAPI_FUNC implies that both are part of the public API. If this is the case, why aren't both documented? If _Py_DecodeLocaleEx is not part of the public API (and the leading underscore suggests so), should it be polluting the symbol space?
The comment for PyAPI_FUNC is "Declares a public Python API function and return type". Is this really the case, or has PyAPI_FUNC been coopted to provide external linkage for use by Python-internal code in different compilation units? _Py_DecodeLocaleEx is called in Modules/_testcapimodule.c and also in Objects/unicodeobject.c.
If we want to take steps to restrict symbol visibility, it will potentially affect all of the code base - so presumably, a PEP would be required, even though it's an implementation detail from the point of view of the language itself?
Regards,
Vinay Sajip

On Mon, Sep 23, 2019, 08:28 Vinay Sajip via Python-Dev < python-dev@python.org> wrote:
requires some newer tools like -fvisibility=hidden that work differently across different platforms, and so far no-one's done the work to sort out the details.
I've started looking at this, but quite apart from the specifics of applying -fvisibility=hidden, there are some things that aren't yet clear to me about the intent behind some of our symbol definitions. For example, the file Include/fileutils.h contains the definitions
PyAPI_FUNC(wchar_t *) Py_DecodeLocale(const char *arg, size_t *size);
and
PyAPI_FUNC(int) _Py_DecodeLocaleEx(const char *arg, wchar_t **wstr, size_t *wlen, const char **reason, int current_locale, _Py_error_handler errors);
However, only the first of these is documented, though the definition via PyAPI_FUNC implies that both are part of the public API. If this is the case, why aren't both documented? If _Py_DecodeLocaleEx is not part of the public API (and the leading underscore suggests so), should it be polluting the symbol space?
The comment for PyAPI_FUNC is "Declares a public Python API function and return type". Is this really the case, or has PyAPI_FUNC been coopted to provide external linkage for use by Python-internal code in different compilation units? _Py_DecodeLocaleEx is called in Modules/_testcapimodule.c and also in Objects/unicodeobject.c.
If we want to take steps to restrict symbol visibility, it will potentially affect all of the code base - so presumably, a PEP would be required, even though it's an implementation detail from the point of view of the language itself?
Windows already has working symbol visibility handling, and PyAPI_FUNC is what controls it. So adding symbol visibility handling to Linux/macOS is just about making all the platforms consistent. There might be some weird choices being made, but I don't think you need to sort all those out as part of this.
-n

Nathaniel Smith wrote:
Windows already has working symbol visibility handling, and PyAPI_FUNC is what controls it. So adding symbol visibility handling to Linux/macOS is just about making all the platforms consistent. There might be some weird choices being made, but I don't think you need to sort all those out as part of this.
Well, _Py_DecodeLocaleEx is declared with PyAPI_FUNC, so would you expect it to be exposed on Windows? I haven't a Windows machine handy right now, but I would expect it not to be exposed, as it doesn't appear in PC/python3.def. I will check when I get a chance.
Regards,
Vinay Sajip

Le lun. 23 sept. 2019 à 21:36, Vinay Sajip via Python-Dev python-dev@python.org a écrit :
Nathaniel Smith wrote:
Windows already has working symbol visibility handling, and PyAPI_FUNC is what controls it. So adding symbol visibility handling to Linux/macOS is just about making all the platforms consistent. There might be some weird choices being made, but I don't think you need to sort all those out as part of this.
Well, _Py_DecodeLocaleEx is declared with PyAPI_FUNC, so would you expect it to be exposed on Windows?
_Py_DecodeLocaleEx() should not be used outside Python. It's really a low-level function which should not be used directly.
Its definition should be moved to the internal API and it should use extern rather than PyAPI_FUNC(). It's a private API so we can make it internal.
Victor

OK - but that's just one I picked at random. There are others like it - what would be the process for deciding which ones need to be made private and moved? Should an issue be raised to track this?
Regards,
Vinay Sajip

On 2019-09-23 21:22, Vinay Sajip via Python-Dev wrote:
OK - but that's just one I picked at random. There are others like it - what would be the process for deciding which ones need to be made private and moved? Should an issue be raised to track this?
It's not really a question of which ones should be made private, but of which ones should be remain public. If it's mentioned in the docs, then it's public, otherwise it should be private. How easy would it be to search the sources and the docs for the ones that are currently public but not documented?

How easy would it be to search the sources and the docs for the ones
that are currently public but not documented?
Comparing the docs and .c files for functions that are not documented would be fairly trivial, but it's very difficult to define something as being public if it's not documented, since that's the main definition used. If it's not in the documentation, there's not a definitive way tell that it's public (AFAIK).
The presence or lack of the name starting with an underscore can sometimes be an indicator, but that's not consistently reliable.
On Mon, Sep 23, 2019 at 7:59 PM MRAB python@mrabarnett.plus.com wrote:
On 2019-09-23 21:22, Vinay Sajip via Python-Dev wrote:
OK - but that's just one I picked at random. There are others like it -
what would be the process for deciding which ones need to be made private and moved? Should an issue be raised to track this?
It's not really a question of which ones should be made private, but of which ones should be remain public. If it's mentioned in the docs, then it's public, otherwise it should be private. How easy would it be to search the sources and the docs for the ones that are currently public but not documented? _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4KUPJBNX...

On Mon, Sep 23, 2019 at 1:30 PM Vinay Sajip via Python-Dev python-dev@python.org wrote:
OK - but that's just one I picked at random. There are others like it - what would be the process for deciding which ones need to be made private and moved? Should an issue be raised to track this?
There are really two issues here:
- hiding the symbols that *aren't* marked PyAPI_*, consistently across platforms. - finding symbols that are currently marked PyAPI_*, but shouldn't be.
The first one is a pretty straightforward technical improvement. The second one is a longer-term project that could easily get bogged down in complex judgement calls. So let's worry about them separately. Even if there are too many symbols marked PyAPI_*, we can still get started on hiding all the symbols that we *know* should be hidden.
-n

Fair enough. Pull request created:
https://github.com/python/cpython/pull/16347
Regards,
Vinay Sajip

On Tue., 24 Sep. 2019, 10:29 am Nathaniel Smith, njs@pobox.com wrote:
On Mon, Sep 23, 2019 at 1:30 PM Vinay Sajip via Python-Dev python-dev@python.org wrote:
OK - but that's just one I picked at random. There are others like it -
what would be the process for deciding which ones need to be made private and moved? Should an issue be raised to track this?
There are really two issues here:
- hiding the symbols that *aren't* marked PyAPI_*, consistently across
platforms.
- finding symbols that are currently marked PyAPI_*, but shouldn't be.
The first one is a pretty straightforward technical improvement. The second one is a longer-term project that could easily get bogged down in complex judgement calls. So let's worry about them separately. Even if there are too many symbols marked PyAPI_*, we can still get started on hiding all the symbols that we *know* should be hidden.
One of the other things to keep in mind is that a lot of new symbol exports are created by copying an existing one, rather than by going & reading the C API documentation on exporting symbols correctly.
Historically, though, it wasn't especially obvious from the header files themselves that "exported for internal linkage" and "exported as part of the public C API" should be written differently, so it was easy to use the wrong style, and patch reviewers wouldn't necessarily notice.
With the structural split now in place, the creep of incorrectly published symbols should be contained, and the public header files can start being reviewed for cases like this one.
Cheers, Nick.

Vinay Sajip wrote:
Right, I'm pretty sure that right now Python doesn't have any way to share symbols between .c files without also exposing them in the C API. On other C projects I've worked on, the public API is expressed in one set
of header files, and internal APIs that need to be exposed across modules are described in a different set of internal header files, and developers who incorrectly use internal APIs by including the internal headers could see breakage when the internals change ... excuse my naïveté, as I haven't done much at Python's C level - does this discipline/approach not apply to CPython?
As of Python 3.8 we do this sort of separation: https://docs.python.org/3.8/whatsnew/3.8.html#build-and-c-api-changes.

Thanks for the pointer. According to that information, everything in Include/fileutils.h should be the portable public C API. But there are definitions in there which start with _Py - and the information relating to Include/cpython/*.h suggests that API prefixed with _Py is conventionally private. Is there a reason why (for example) _Py_DecodeLocaleEx appears in Include/*.h with a prefix conventionally suggesting it's private, and which is backed up by the fact that it's not mentioned in the C API documentation? If it's just there for the moment because no-one got around to moving it to Include/cpython/*.h, it would be useful to know that. I'm not trying to nit-pick here - if we're going to apply visibility rules, then things like this need to be absolutely clear.
Just for info, I tried configuring and compiling with -fvisibility=hidden and declared default visibility on PyAPI_FUNC, PyAPI_DATA and PyMODINIT_FUNC, as well as one or two other declarations. The resulting build passes all tests run locally, except test_tools (which is currently also failing for me on master). I haven't yet pushed these changes to my public CPython fork, as it's still work in progress - which is why I'm asking these questions.
Regards,
Vinay Sajip

From my understanding, _Py_IDENTIFIER was designed for static usage. The overhead is quite low; keeping it as a private (static) module-level internal variable helps to decouple things.
I understand the importance of decoupling in general, but meanings of __name__, __dict__ etc. aren't going to change. What practical difficulties (in terms of coupling) do you foresee if e.g. all C modules know that somewhere there is a constant value representing the constant string "__dict__", and so on? There's no suggestion that some C modules will use a different constant string value to refer to e.g. the "__dict__" attribute than other C modules, is there? I'm not suggesting that all _Py_IDENTIFIERs need to be extern, just the ones whose meanings are public and can't be changed.
I'm not coming from a worried-about-the-memory-overhead perspective - it's more about the manageability of ad hoc static variables which can be "considered harmful" in certain scenarios - such as sub-interpreters, for example. _Py_IDENTIFIER seems to account for a fair proportion of all static variable declarations.
This target is very important for keeping public API as small as possible.
I'm not suggesting that _Py_IDENTIFIERs become part of the public API. Surely it's not the case that any and every extern identifier becomes part of the public API? (I know people can misuse things and use unsupported internal APIs, but then they're on their own, right?)
Regards,
Vinay Sajip

20.09.19 22:19, Vinay Sajip via Python-Dev пише:
I just ran an analysis of static variable definitions in CPython code, using clang, on Ubuntu and Windows. The results should be available here:
As I understand it, _Py_IDENTIFIER instances are supposed to hold constant strings that are used in Python - e.g. "__class__", "__dict__" and so on. I noticed that there are numerous duplicates of these - e.g. 8 instances of __name__, 11 instances of __dict__, and so on - each instance is defined as static in its source file and so completely distinct from the others.
Note that corresponding strings are interned, so all _Py_IDENTIFIER(__name__) share the same Python object.
There is also the _PyArg_Parser structure which contains reference to the lazy initialized tuple of strings.

On Sat., 21 Sep. 2019, 5:28 am Vinay Sajip via Python-Dev, < python-dev@python.org> wrote:
I just ran an analysis of static variable definitions in CPython code, using clang, on Ubuntu and Windows. The results should be available here:
As I understand it, _Py_IDENTIFIER instances are supposed to hold constant strings that are used in Python - e.g. "__class__", "__dict__" and so on. I noticed that there are numerous duplicates of these - e.g. 8 instances of __name__, 11 instances of __dict__, and so on - each instance is defined as static in its source file and so completely distinct from the others.
I realise the overall amount of memory used by these structures isn't large, but is there any particular benefit to having these multiple copies? The current situation seems a little untidy, at least. What would be the disadvantage of making them extern in the headers and allocating them once in some consts.c module? After all, they seem for the most part to be well-known constant strings that don't need to be encapsulated in any particular C compilation unit.
Moving some of the especially common identifier references into the interpreter state struct may make sense.
Adding more process globals wouldn't be desirable though, as they're one of the more common ways of breaking encapsulation between subinterpreters (hence Eric's efforts to eliminate as many of them as he reasonably can).
Cheers, Nick.

Moving some of the especially common identifier references into the interpreter state struct may make sense. Adding more process globals wouldn't be desirable though, as they're one of the more common ways of breaking encapsulation between subinterpreters (hence Eric's efforts to eliminate as many of them as he reasonably can).
Well, another way of breaking encapsulation is by having ad hoc statics (as opposed to globals) which should really be in the interpreter state - it prevents compartmentalizing the GIL into per-subinterpreter GILs, for example. That's my motivation for looking at this area - I spent a bit of time working with Eric at the recent core dev sprint, and wanted to explore the problems in this area.
Regards,
Vinay
participants (10)
-
Andrew Svetlov
-
Brett Cannon
-
Kyle Stanley
-
MRAB
-
Nathaniel Smith
-
Nick Coghlan
-
Serhiy Storchaka
-
Victor Stinner
-
Vinay Sajip
-
vinay_sajip@yahoo.co.uk