[Import-SIG] Proto-PEP: Redesigning extension module loading

Nick Coghlan ncoghlan at gmail.com
Sat Feb 21 13:19:55 CET 2015


On 21 February 2015 at 00:56, Petr Viktorin <encukou at gmail.com> wrote:
> Hello list,
>
> I have taken Nick's challenge of extension module loading.

Thanks for tackling this!

> I've read some of the relevant discussions, and bounced my ideas off Nick
> to see if I missed anything important.
>
> The main idea I realized, which was not obvious from the discussion,
> was that in addition to playing well with PEP 451 (ModuleSpec) and supporting
> subinterpreters and multiple Py_Initialize/Py_Finalize cycles,
> Nick's Create/Exec proposal allows executing the module in a "foreign",
> externally created module object. The main use case for that would be runpy and
> __main__, but lazy-loading mechanisms were mentioned that would benefit as well.

For everyone else's reference: this actually came up in Petr's earlier
off-list discussions with me, when I realised I'd had the "running
extension modules as __main__" use case in mind myself, but never
actually written that notion down anywhere.

It's the one capability of PyModuleExec_* that simply doesn't exist today.

> As I was writing this down, I realized that once pre-created modules are
> allowed, it makes no sense to insist that they actually are module
> instances -- PyModule_Type provides little functionality above a plain object
> subclass. I'm not sure there are any use cases for this, but I don't see a
> reason to limit things artificially. Any bugs caused by allowing
> non-ModuleType modules are unlikely to be subtle, unless the custom object
> passes the "asked for it" line.
>
> Comments appreciated.

This generally looks good to me. Some more specific feedback inline below.

> PEP: XXX
> Title: Redesigning extension module loading

For the BDFL-Delegate question: Brett would you be happy tackling this one?

> Motivation
> ==========
>
> Python modules and extension modules are not being set up in the same way.
> For Python modules, the module is created and set up first, then the module
> code is being executed (PEP 302).
> A ModuleSpec object (PEP 451) is used to hole information about the module,
> and pased to the relevant hooks.

s/hole/hold/
s/pased/passed/

<snip>

> Furthermore, the majority of currently existing extension modules has
> problems with sub-interpreter support and/or reloading, and, while it is
> it possible with the current infrastructure to support these
> features, is neither easy nor efficient.
> Addressing these issues was the goal of PEP 3121, but many extensions
> took the least-effort approach to porting to Python 3, leaving many of these
> issues unresolved.

It's probably worth noting that some of those "least-effort" porting
approaches are in the standard library: this PEP is about solving our
own problems in addition to other people's.

> Thius PEP keeps the backwards-compatible behavior, which should reduce pressure
> and give extension authors adequate time to consider these issues when porting.

s/thius/this/

> The proposal
> ============
>
> The current extension module initialisation will be deprecated in favour of
> a new initialisation scheme. Since the current scheme will continue to be
> available, existing code will continue to work unchanged, including binary
> compatibility.
>
> Extension modules that support the new initialisation scheme must export one
> or both of the public symbols "PyModuleCreate_modulename" and
> "PyModuleExec_modulename", where "modulename" is the
> name of the shared library. This mimics the previous naming convention for
> the "PyInit_modulename" function.
>
> This symbols, if defined, must resolve to C functions with the following
> signatures, respectively::
>
>     PyObject* (*PyModuleCreateFunction)(PyObject* module_spec)
>     int (*PyModuleExecFunction)(PyObject* module)

For the Python level, the model we ended up with for 3.5 is:

1. create_module must exist, but may return None
2. exec_module must exist, but may have no effect on the module state

For the new C level API, it's probably worth drawing the more explicit
parallel to __new__ and __init__ on classes, where you can implement
both of them if you want, but in most cases, implementing only one or
the other will be sufficient.

The reason I suggest that is because I was going to ask if we should
make providing both APIs, or at least PyModuleExec_*, compulsory
(based on the Python Loader API requirements), but thinking of the
__new__/__init__ analogy made me realise that your current design
makes sense, since dealing with it is confined specifically to the
extension module loader implementation.

> The PyModuleCreate function
> ---------------------------

<snip>

> When called, this function must create and return a module object.
>
> If "PyModuleExec_module" is undefined, this function must also initialize
> the module; see PyModuleExec_module for details on initialization.

This should be clarified to point out that, as per PEP 451, the import
machinery will still take care of setting the import related
attributes after the loader returns the module from create_module.

> There is no requirement for the returned object to be an instance of
> types.ModuleType. Any type can be used.

The requirement for the returned object to support getting and setting
attributes (as per
https://www.python.org/dev/peps/pep-0451/#attributes) should be
defined here.

> This follows the current
> support for allowing arbitrary objects in sys.modules and makes it easier
> for extension modules to define a type that exactly matches their needs for
> holding module state.

+1

> The PyModuleExec function
> -------------------------
>
> This PyModuleExec function is used to implement "loader.exec_module"
> defined in PEP 451.
> It is called after ModuleSpec-related attributes such as ``__loader__``,
> ``__spec__`` and ``__name__`` are set on the module.
> (The full list is in PEP 451 [#pep-0451-attributes]_)
>
> The "PyModuleExec_modulename" function will be called to initialize a module.
> This happens in two situations: when the module is first initialized for
> a given (sub-)interpreter, and when the module is reloaded.
>
> The "module" argument receives the module object.
> If PyModuleCreate is defined, this will be the the object returned by it.
> If PyModuleCreate is not defined, PyModuleExec is epected to operate
> on any Python object for which attributes can be added by PyObject_GetAttr*
> and retreived by PyObject_SetAttr*.
> Specifically, as the module may not be a PyModule_Type subclass,
> PyModule_* functions should not be used on it, unless they explicitly support
> operating on all objects.

I think this is too permissive on the interpreter side of things, thus
making things more complicated than we'd like them to be for extension
module authors.

If PyModuleCreate_* is defined, PyModuleExec_* will receive the object
returned there, while if it isn't defined, the interpreter *will*
provide a PyModule_Type instance, as per PEP 451.

However, permitting module authors to make the PyModule_Type (or a
subclass) assumption in their implementation does introduce a subtle
requirement on the implementation of both the load_module method, and
on custom PyModuleExec_* functions that are paired with a
PyModuleCreate_* function.

Firstly, we need to enforce the following constraint in load_module:
if the underlying C module does *not* define a custom PyModuleCreate_*
function, and we're passed a module execution environment which is
*not* an instance of PyModule_Type, then we should throw TypeError.

By contrast, in the presence of a custom PyModuleCreate_* function,
the requirement for checking the type of the execution environment
(and throwing TypeError if the module can't handle it) should be
delegated to the PyModuleExec_* function, and that will need to be
documented appropriately.

That keeps things simple in the default case (extension module authors
just using PyModuleExec_* can continue to assume the use of
PyModule_Type or a subclass), while allowing more flexibility in the
"power user" case of creating your own module object.

> Usage
> =====
>
> This PEP allows three new ways of creating modules, each with its
> advantages and disadvantages.
>
>
> Exec-only
> ---------
>
> The preferred way to create C extensions is to define "PyModuleExec_modulename"
> only. This brings the following advantages:
>
> * The extension can be loaded into a pre-created module, making it possible
>   to run them as ``__main__``, participate in certain lazy-loading schemes
>   [#lazy_import_concerns]_, or enable other creative uses.
> * The module can be reloaded in the same way as Python modules.
>
> As Exec-only extension modules do not have C-level storage,
> all module-local data must be stored in the module object's attributes,
> possibly using the PyCapsule mechanism.

With my suggested change above, this approach will also let module
authors assume PyModule_Type (or a subclass), and have the interpreter
enforce that assumption on their behalf.

> Create-only
> -----------
>
> Extensions defining only the "PyModuleCreate_modulename" hook behave similarly
> to current extensions.
>
> This is the easiest way to create modules that require custom module objects,
> or substantial per-module state at the C level (using positive
> ``PyModuleDef.m_size``).
>
> When the PyModuleCreate function is called, the module has not yet been added
> to sys.modules.
> Attempts to load the module again (possibly transitively) will result in an
> infinite loop.
> If user code needs to me called in module initialization,
> module authors are advised to do so from the PyModuleExec function.
>
> Reloading a Create-only module does nothing, except re-setting
> ModuleSpec-related attributes described in PEP 0451 [#pep-0451-attributes].

Another advantage of this approach is that you don't need to worry
about potentially being passed a module object of an arbitrary type.

> Exec and Create
> ---------------
>
> Extensions that need to create a custom module object,
> and either need to run user code during initialization or support reloading,
> should define both "PyModuleCreate_modulename" and "PyModuleExec_modulename".

This approach will have the downside of needing to check the type of
the passed in module against the module implementation's assumptions.

> Subinterpreters and Interpreter Reloading
> -----------------------------------------
>
> Extensions using the new initialization scheme are expected to support
> subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly.
> The mechanism is designed to make this easy, but care is still required
> on the part of the extension author.
> No user-defined functions, methods, or instances may leak to different
> interpreters.
> To achieve this, all module-level state should be kept in either the module
> dict, or in the module object.
> A simple rule of thumb is: Do not define any static data, except built-in types
> with no mutable or user-settable class attributes.

Worth noting here that this is why we consider it desirable to provide
a utility somewhere in the standard library to make it easy to do
these kinds of checks.

At the very least we need it in the test.support module to do our own
tests, but it would be preferable to have it as a supported API
somewhere in the standard library.

This isn't the only area where this kind of question of making it
easier for people to test whether or not they're implementing or
emulating a protocol correctly has come up - it's applicable to
testing things like total ordering support in custom objects, operand
precedence handling, ABC compliance, code generation, exception
traceback manipulation, etc.

Perhaps we should propose a new unittest submodule for compatibility
and compliance tests that are too esoteric for the module top level,
but we also don't want to ask people to write for themselves?

> Module Reloading
> ----------------
>
> Extensions that support reloading must define PyModuleExec, which is called
> in reload() to re-initialize the module in place.
> The same caveats apply to reloading an extension module as to reloading
> a Python module.

Assuming you go with my suggestion regarding the PyModule_Type
assumption above, that would be worth reiterating here.

> Multiple modules in one library
> -------------------------------
>
> To support multiple Python modules in one shared library, the library
> must export all appropriate PyModuleExec_<name> or PyModuleCreate_<name> hooks
> for each exported module.
> The modules are loaded using a ModuleSpec with origin set to the name of the
> library file, and name set to the module name.
> Note that this mechanism can only be used to *load* such modules,
> not to *find* them.

If I recall correctly, Brett already updated the extension module
finder to handle locating such modules. It's either that or there's an
existing issue on the tracker for it.

> Open issues
> ===========
>
> Now that PEP 442 is implemented, it would be nice if module finalization
> did not set all attributes to None,

Antoine added that in 3.4: http://bugs.python.org/issue18214

However, it wasn't entirely effective, as several extension modules
still need to be hit with a sledgehammer to get them to drop
references properly. Asking "Why is that so?" is actually one of the
things that got me started digging into this area a couple of years
back.

> In this scheme, it is not possible to create a module with C-level state,
> which would be able to exec itself in any externally provided module object,
> short of putting PyCapsules in the module dict.

I suspect "PyCapsule in the module dict" may be the right answer here,
in which case some suitable documentation and perhaps some convenience
APIs could be a good way to go.

Relying on PyCapsule also has the advantage of potentially supporting
better collaboration between extension modules, without needing to
link them with each other directly.

> The proposal repurposes PyModule_SetDocString, PyModule_AddObject,
> PyModule_AddIntMacro et.al. to work on any object.
> Would it be better to have these in the PyObject namespace?

With my proposal above to keep the PyModule_Type assumption in most
cases, I think it may be better to leave them alone entirely. If folks
decide to allow non module types, they can decide to handle the
consequences.

> We should expose some kind of API in importlib.util (or a better place?) that
> can be used to check that a module works with reloading and subinterpreters.

See comments above on that.

> The runpy module will need to be modified to take advantage of PEP 451
> and this PEP. This might out of scope for this PEP.

I think it's out of scope, but runpy *does* need an internal redesign
to take full advantage of PEP 451. Currently it works by attempting to
extract the code object directly in most situations, whereas PEP 451
should let it rely almost entirely on exec_code instead (with direct
execution used only when it's actually given a path directly to a
Python source or bytecode file.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Import-SIG mailing list