PEP 489: Redesigning extension module loading

Hello, On import-sig, I've agreed to continue Nick Coghlan's work on making extension modules act more like Python ones, work well with PEP 451 (ModuleSpec), and encourage proper subinterpreter and reloading support. Here is the resulting PEP. I don't have a patch yet, but I'm working on it. There's a remaining open issue: providing a tool that can be run in test suites to check if a module behaves well with subinterpreters/reloading. I believe it's out of scope for this PEP but speak out if you disagree. Please discuss on import-sig. ======================= PEP: 489 Title: Redesigning extension module loading Version: $Revision$ Last-Modified: $Date$ Author: Petr Viktorin <encukou@gmail.com>, Stefan Behnel <stefan_ml@behnel.de>, Nick Coghlan <ncoghlan@gmail.com> Discussions-To: import-sig@python.org Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 11-Aug-2013 Python-Version: 3.5 Post-History: 23-Aug-2013, 20-Feb-2015 Resolution: Abstract ======== This PEP proposes a redesign of the way in which extension modules interact with the import machinery. This was last revised for Python 3.0 in PEP 3121, but did not solve all problems at the time. The goal is to solve them by bringing extension modules closer to the way Python modules behave; specifically to hook into the ModuleSpec-based loading mechanism introduced in PEP 451. Extensions that do not require custom memory layout for their module objects may be executed in arbitrary pre-defined namespaces, paving the way for extension modules being runnable with Python's ``-m`` switch. Other extensions can use custom types for their module implementation. Module types are no longer restricted to types.ModuleType. This proposal makes it easy to support properties at the module level and to safely store arbitrary global state in the module that is covered by normal garbage collection and supports reloading and sub-interpreters. Extension authors are encouraged to take these issues into account when using the new API. Motivation ========== Python modules and extension modules are not being set up in the same way. For Python modules, the module is created and set up first, then the module code is being executed (PEP 302). A ModuleSpec object (PEP 451) is used to hold information about the module, and passed to the relevant hooks. For extensions, i.e. shared libraries, the module init function is executed straight away and does both the creation and initialisation. The initialisation function is not passed ModuleSpec information about the loaded module, such as the __file__ or fully-qualified name. This hinders relative imports and resource loading. This is specifically a problem for Cython generated modules, for which it's not uncommon that the module init code has the same level of complexity as that of any 'regular' Python module. Also, the lack of __file__ and __name__ information hinders the compilation of __init__.py modules, i.e. packages, especially when relative imports are being used at module init time. The other disadvantage of the discrepancy is that existing Python programmers learning C cannot effectively map concepts between the two domains. As long as extension modules are fundamentally different from pure Python ones in the way they're initialised, they are harder for people to pick up without relying on something like cffi, SWIG or Cython to handle the actual extension module creation. Currently, extension modules are also not added to sys.modules until they are fully initialized, which means that a (potentially transitive) re-import of the module will really try to reimport it and thus run into an infinite loop when it executes the module init function again. Without the fully qualified module name, it is not trivial to correctly add the module to sys.modules either. Furthermore, the majority of currently existing extension modules has problems with sub-interpreter support and/or reloading, and, while it is possible with the current infrastructure to support these features, it is neither easy nor efficient. Addressing these issues was the goal of PEP 3121, but many extensions, including some in the standard library, took the least-effort approach to porting to Python 3, leaving these issues unresolved. This PEP keeps the backwards-compatible behavior, which should reduce pressure and give extension authors adequate time to consider these issues when porting. The current process =================== Currently, extension modules export an initialisation function named "PyInit_modulename", named after the file name of the shared library. This function is executed by the import machinery and must return either NULL in the case of an exception, or a fully initialised module object. The function receives no arguments, so it has no way of knowing about its import context. During its execution, the module init function creates a module object based on a PyModuleDef struct. It then continues to initialise it by adding attributes to the module dict, creating types, etc. In the back, the shared library loader keeps a note of the fully qualified module name of the last module that it loaded, and when a module gets created that has a matching name, this global variable is used to determine the fully qualified name of the module object. This is not entirely safe as it relies on the module init function creating its own module object first, but this assumption usually holds in practice. The proposal ============ The current extension module initialisation will be deprecated in favour of a new initialisation scheme. Since the current scheme will continue to be available, existing code will continue to work unchanged, including binary compatibility. Extension modules that support the new initialisation scheme must export the public symbol "PyModuleExec_modulename", and optionally "PyModuleCreate_modulename", where "modulename" is the name of the module. This mimics the previous naming convention for the "PyInit_modulename" function. If defined, these symbols must resolve to C functions with the following signatures, respectively:: int (*PyModuleExecFunction)(PyObject* module) PyObject* (*PyModuleCreateFunction)(PyObject* module_spec) The PyModuleExec function ------------------------- The PyModuleExec function is used to implement "loader.exec_module" defined in PEP 451. It function will be called to initialize a module. (Usually, this amounts to setting the module's initial attributes.) This happens in two situations: when the module is first initialized for a given (sub-)interpreter, and possibly later when the module is reloaded. When PyModuleExec is called, the module has already been added to sys.modules, and import-related attributes specified in PEP 451 [#pep-0451-attributes]_) have been set on the module. The "module" argument receives the module object to initialize. If PyModuleCreate is defined, "module" will generally be the the object returned by it. It is possible for a custom loader to pass any object to PyModuleExec, so this function should check and fail with TypeError if the module's type is unsupported. Any other assumptions should also be checked. If PyModuleCreate is not defined, PyModuleExec is expected to operate on any Python object for which attributes can be added by PyObject_GetAttr* and retrieved by PyObject_SetAttr*. This allows loading an extension into a pre-created module, making it possible to run it as __main__ in the future, participate in certain lazy-loading schemes [#lazy_import_concerns]_, or enable other creative uses. If PyModuleExec replaces the module's entry in sys.modules, the new object will be used and returned by importlib machinery. (This mirrors the behavior of Python modules. Note that for extensions, implementing PyModuleCreate is usually a better solution for the use cases this serves.) The function must return ``0`` on success, or, on error, set an exception and return ``-1``. The PyModuleCreate function --------------------------- The optional PyModuleCreate function is used to implement "loader.create_module" defined in PEP 451. By exporting it, an extension module indicates that it uses a custom module object. This prevents loading the extension in a pre-created module, but gives greater flexibility in allowing a custom C-level layout of the module object. Most extensions will not need to implement this function. The "module_spec" argument receives a "ModuleSpec" instance, as defined in PEP 451. When called, this function must create and return a module object, or set an exception and return NULL. There is no requirement for the returned object to be an instance of types.ModuleType. Any type can be used, as long as it supports setting and getting attributes, including at least the import-related attributes specified in PEP 451 [#pep-0451-attributes]_. This follows the current support for allowing arbitrary objects in sys.modules and makes it easier for extension modules to define a type that exactly matches their needs for holding module state. Note that when this function is called, the module's entry in sys.modules is not populated yet. Attempting to import the same module again (possibly transitively), may lead to an infinite loop. Extension authors are advised to keep PyModuleCreate minimal, an in particular to not call user code from it. If PyModuleCreate is not defined, the default loader will construct a module object as if with PyModule_New. Initialization helper functions ------------------------------- For two initialization tasks previously done by PyModule_Create, two functions are introduced:: int PyModule_SetDocString(PyObject *m, const char *doc) int PyModule_AddFunctions(PyObject *m, PyMethodDef *functions) These set the module docstring, and add the module functions, respectively. Both will work on any Python object that supports setting attributes. They return ``0`` on success, and on failure, they set the exception and return ``-1``. PyCapsule convenience functions ------------------------------- Instead of custom module objects, PyCapsule will become the preferred mechanism for storing per-module C data. Two new convenience functions will be added to help with this. * :: PyObject *PyModule_AddCapsule( PyObject *module, const char *module_name, const char *attribute_name, void *pointer, PyCapsule_Destructor destructor) Add a new PyCapsule to *module* as *attribute_name*. The capsule name is formed by joining *module_name* and *attribute_name* by a dot. This convenience function can be used from a module initialization function instead of separate calls to PyCapsule_New and PyModule_AddObject. Returns a borrowed reference to the new capsule, or NULL (with exception set) on failure. * :: void *PyModule_GetCapsulePointer( PyObject *module, const char *module_name, const char *attribute_name) Returns the pointer stored in *module* as *attribute_name*, or NULL (with an exception set) on failure. The capsule name is formed by joining *module_name* and *attribute_name* by a dot. This convenience function can be used instead of separate calls to PyObject_GetAttr and PyCapsule_GetPointer. Extension authors are encouraged to define a macro to call PyModule_GetCapsulePointer and cast the result to an appropriate type. Generalizing PyModule_* functions --------------------------------- The following functions and macros will be modified to work on any object that supports attribute access: * PyModule_GetNameObject * PyModule_GetName * PyModule_GetFilenameObject * PyModule_GetFilename * PyModule_AddIntConstant * PyModule_AddStringConstant * PyModule_AddIntMacro * PyModule_AddStringMacro * PyModule_AddObject The PyModule_GetDict function will continue to only work on true module objects. This means that it should not be used on extension modules that only define PyModuleExec. Legacy Init ----------- If PyModuleExec is not defined, the import machinery will try to initialize the module using the PyModuleInit hook, as described in PEP 3121. If PyModuleExec is defined, PyModuleInit will be ignored. Modules requiring compatibility with previous versions of CPython may implement PyModuleInit in addition to the new hook. Subinterpreters and Interpreter Reloading ----------------------------------------- Extensions using the new initialization scheme are expected to support subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly. The mechanism is designed to make this easy, but care is still required on the part of the extension author. No user-defined functions, methods, or instances may leak to different interpreters. To achieve this, all module-level state should be kept in either the module dict, or in the module object. A simple rule of thumb is: Do not define any static data, except built-in types with no mutable or user-settable class attributes. Module Reloading ---------------- Reloading an extension module will re-execute its PyModuleInit function. Similar caveats apply to reloading an extension module as to reloading a Python module. Notably, attributes or any other state of the module are not reset before reloading. Additionally, due to limitations in shared library loading (both dlopen on POSIX and LoadModuleEx on Windows), it is not generally possible to load a modified library after it has changed on disk. Therefore, reloading extension modules is of limited use. Multiple modules in one library ------------------------------- To support multiple Python modules in one shared library, the library must export appropriate PyModuleExec_<name> or PyModuleCreate_<name> hooks for each exported module. The modules are loaded using a ModuleSpec with origin set to the name of the library file, and name set to the module name. Note that this mechanism can currently only be used to *load* such modules, not to *find* them. XXX: This is an existing issue; either fix it/wait for a fix or provide an example of how to load such modules. Implementation ============== XXX - not started Open issues =========== We should expose some kind of API in importlib.util (or a better place?) that can be used to check that a module works with reloading and subinterpreters. Related issues ============== The runpy module will need to be modified to take advantage of PEP 451 and this PEP. This is out of scope for this PEP. Previous Approaches =================== Stefan Behnel's initial proto-PEP [#stefans_protopep]_ had a "PyInit_modulename" hook that would create a module class, whose ``__init__`` would be then called to create the module. This proposal did not correspond to the (then nonexistent) PEP 451, where module creation and initialization is broken into distinct steps. It also did not support loading an extension into pre-existing module objects. Nick Coghlan proposed the Create annd Exec hooks, and wrote a prototype implementation [#nicks-prototype]_. At this time PEP 451 was still not implemented, so the prototype does not use ModuleSpec. References ========== .. [#lazy_import_concerns] https://mail.python.org/pipermail/python-dev/2013-August/128129.html .. [#pep-0451-attributes] https://www.python.org/dev/peps/pep-0451/#attributes .. [#stefans_protopep] https://mail.python.org/pipermail/python-dev/2013-August/128087.html .. [#nicks-prototype] https://mail.python.org/pipermail/python-dev/2013-August/128101.html Copyright ========= This document has been placed in the public domain.

On 16 March 2015 Petr Viktorin wrote:
If PyModuleCreate is not defined, PyModuleExec is expected to operate on any Python object for which attributes can be added by PyObject_GetAttr* and retrieved by PyObject_SetAttr*.
I assume it is the other way around (add with Set and retrieve with Get), rather than a description of the required form of magic.
PyObject *PyModule_AddCapsule( PyObject *module, const char *module_name, const char *attribute_name, void *pointer, PyCapsule_Destructor destructor)
What happens if module_name doesn't match the module's __name__? Does it become a hidden attribute? A dotted attribute? Is the result undefined? Later, there is
void *PyModule_GetCapsulePointer( PyObject *module, const char *module_name, const char *attribute_name)
with the same apparently redundant arguments, but not a PyModule_SetCapsulePointer. Are capsule pointers read-only, or can they be replaced with another call to PyModule_AddCapsule, or by a simple PyObject_SetAttr?
Subinterpreters and Interpreter Reloading ... No user-defined functions, methods, or instances may leak to different interpreters.
By "user-defined" do you mean "defined in python, as opposed to in the extension itself"? If so, what is the recommendation for modules that do want to support, say, callbacks? A dual-layer mapping that uses the interpreter as the first key? Naming it _module and only using it indirectly through module.py, which is not shared across interpreters? Not using this API at all?
To achieve this, all module-level state should be kept in either the module dict, or in the module object.
I don't see how that is related to leakage.
A simple rule of thumb is: Do not define any static data, except built-in types with no mutable or user-settable class attributes.
What about singleton instances? Should they be per-interpreter? What about constants, such as PI? Where should configuration variables (e.g., MAX_SEARCH_DEPTH) be kept? What happens if this no-leakage rule is violated? Does the module not load, or does it just maybe lead to a crash down the road? -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ

On Mon, Mar 16, 2015 at 4:42 PM, Jim J. Jewett <jimjjewett@gmail.com> wrote:
On 16 March 2015 Petr Viktorin wrote:
If PyModuleCreate is not defined, PyModuleExec is expected to operate on any Python object for which attributes can be added by PyObject_GetAttr* and retrieved by PyObject_SetAttr*.
I assume it is the other way around (add with Set and retrieve with Get), rather than a description of the required form of magic.
Right you are, I mixed that up.
PyObject *PyModule_AddCapsule( PyObject *module, const char *module_name, const char *attribute_name, void *pointer, PyCapsule_Destructor destructor)
What happens if module_name doesn't match the module's __name__? Does it become a hidden attribute? A dotted attribute? Is the result undefined?
The module_name is used to name the capsule, following the convention from PyCapsule_Import. The "module.__name__" is not used or checked. The function would do this: capsule_name = module_name + '.' + attribute_name capsule = PyCapsule_New(pointer, capsule_name, destructor) PyModule_AddObject(module, attribute_name, capsule) just with error handling, and suitable C code for the "+". I will add the pseudocode to the PEP.
Later, there is
void *PyModule_GetCapsulePointer( PyObject *module, const char *module_name, const char *attribute_name)
with the same apparently redundant arguments,
Here the behavior would be: capsule_name = module_name + '.' + attribute_name capsule = PyObject_GetAttr(module, attribute_name) return PyCapsule_GetPointer(capsule, capsule_name)
but not a PyModule_SetCapsulePointer. Are capsule pointers read-only, or can they be replaced with another call to PyModule_AddCapsule, or by a simple PyObject_SetAttr?
You can replace the capsule using any of those two, or set the pointer using PyCapsule_SetPointer, or (most likely) change the data the pointer points to. The added functions are just simple helpers for common operations, meant to encourage keeping per-module state.
Subinterpreters and Interpreter Reloading ... No user-defined functions, methods, or instances may leak to different interpreters.
By "user-defined" do you mean "defined in python, as opposed to in the extension itself"?
Yes.
If so, what is the recommendation for modules that do want to support, say, callbacks? A dual-layer mapping that uses the interpreter as the first key? Naming it _module and only using it indirectly through module.py, which is not shared across interpreters? Not using this API at all?
There is a separate module object, with its own dict, for each subinterpreter (as when creating the module with "PyModuleDef.m_size == 0" today). Callbacks should be stored on the appropriate module instance. Does that answer your question? I'm not sure how you meant "callbacks".
To achieve this, all module-level state should be kept in either the module dict, or in the module object.
I don't see how that is related to leakage.
A simple rule of thumb is: Do not define any static data, except built-in types with no mutable or user-settable class attributes.
What about singleton instances? Should they be per-interpreter?
Yes, definitely.
What about constants, such as PI?
In PyModuleExec, create the constant using PyFloat_FromDouble, and add it using PyModule_FromObject. That will do the right thing. (Float constants can be shared, since they cannot refer to user-defined code. But this PEP shields you from needing to know this for every type.)
Where should configuration variables (e.g., MAX_SEARCH_DEPTH) be kept?
On the module object.
What happens if this no-leakage rule is violated? Does the module not load, or does it just maybe lead to a crash down the road?
It may, as today, lead to unexpected behavior down the road. This is explained here: https://docs.python.org/3/c-api/init.html#sub-interpreter-support Unfortunately, there's no good way to detect such leakage. This PEP adds the tools, documentation, and guidelines to make it easy to do the right thing, but won't prevent you from shooting yourself in the foot in C code. Thank you for sharing your concerns! I will keep them in mind when writing the docs for this.

On 16/03/2015 12:38, Petr Viktorin wrote:
Hello,
Can you use anything from the meta issue http://bugs.python.org/issue15787 for PEP 3121 and PEP 384 or will the work that you are doing render everything done previously redundant? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On 18 March 2015 at 06:41, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
On 16/03/2015 12:38, Petr Viktorin wrote:
Hello,
Can you use anything from the meta issue http://bugs.python.org/issue15787 for PEP 3121 and PEP 384 or will the work that you are doing render everything done previously redundant?
Nothing should break in relation to PEP 3121 or 384, so I think that determination would still need to be made on a case by case basis. Alternatively, it may be possible to update the abitype.py converter to also switch to the new module initialisation hooks (if we can figure out a good way of automating that). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (4)
-
Jim J. Jewett
-
Mark Lawrence
-
Nick Coghlan
-
Petr Viktorin