[Import-SIG] PEP 489: Redesigning extension module loading

Stefan Behnel stefan_ml at behnel.de
Thu Mar 19 11:31:14 CET 2015


Hi Petr,

thanks for working on this. I added my comments inline.

> Motivation
> ==========
> ...
> The other disadvantage of the discrepancy is that existing Python programmers
> learning C cannot effectively map concepts between the two domains.
> As long as extension modules are fundamentally different from pure Python ones
> in the way they're initialised, they are harder for people to pick up without
> relying on something like cffi, SWIG or Cython to handle the actual extension
> module creation.

I don't think cffi fits as an example of extension module creation. It's
more similar to ctypes, i.e. it tries to *avoid* third party extension modules.


> The proposal
> ============
> ...
> Extension modules that support the new initialisation scheme must export
> the public symbol "PyModuleExec_modulename", and optionally
> "PyModuleCreate_modulename", where "modulename" is the
> name of the module. This mimics the previous naming convention for
> the "PyInit_modulename" function.

Just a minor thing, but wouldn't it be better if the two had a common
pre-underscore categorisation prefix, just like all other C-API functions?

"PyExtModule_Exec_modulename" ?

Pros: matches existing naming conventions, suggests that there's more than
one function in this API corner ("you didn't know about Create when you
copied my example?")

Cons: longer and less beautiful name

BTW, is there any way at all we can allow non-ASCII module names in this
scheme? (Might not be in scope for this PEP, but if we change the module
init scheme "for good" this time, it would be nice to have an idea if it'd
be possible to support at all in the future.)


> The PyModuleCreate function
> ---------------------------

I'd move this section here (before Exec) to match the process order and
avoid forward references in the Exec section.

It's worth stating explicitly when this function will be called. I guess
it's always called right before Exec, also for subinterpreters and reload?


> The PyModuleExec function
> -------------------------
> ...
> If PyModuleCreate is not defined, PyModuleExec is expected to operate
> on any Python object for which attributes can be added by PyObject_GetAttr*
> and retrieved by PyObject_SetAttr*.

Good point. I think it's a valid requirement (and not a real restriction)
that PEP-489 extension modules without a Create must accept any kind of
object as "module", not just a PyModuleObject.

The main problem with these things is that, in practice, the module *will*
continue to be a PyModuleObject for the foreseeable future, so module
authors will implicitly rely on it in one way or another...

However:

> This allows loading an extension into a pre-created module, making it possible
> to run it as __main__ in the future, participate in certain lazy-loading
> schemes [#lazy_import_concerns]_, or enable other creative uses.

That sounds like a rather random bucket of potential future extensions, not
sure it should be part of the PEP.

But how is this requirement related to "__main__"? Does the proposed scheme
really prevent that when Create *is* being implemented? How so?

Or does it mean that modules that provide a Create function will never be
able to be loaded lazily or in "other creative" ways? Cython modules will
almost certainly (pending an actual implementation) provide their own
Create function, for example. And others as well, given that a previous
section has warmly advertised Create as a way to implement module
properties, a feature that many extension module authors have felt a use
for at some point.


> Initialization helper functions
> -------------------------------
> 
> For two initialization tasks previously done by PyModule_Create,
> two functions are introduced::
> 
>     int PyModule_SetDocString(PyObject *m, const char *doc)
>     int PyModule_AddFunctions(PyObject *m, PyMethodDef *functions)
> 
> These set the module docstring, and add the module functions, respectively.

Are these intended to be called by Create or Exec? While it sounds most
appropriate to have Create set up the basic module object, calling these in
Exec (i.e. after letting CPython set the module name/path/etc.) gives more
freedom to the user. Should it matter? Should we suggest generally calling
them in Exec instead of Create? (If only for consistency with modules that
do not have a Create...)


> PyCapsule convenience functions
> -------------------------------
> 
> Instead of custom module objects, PyCapsule will become the preferred
> mechanism for storing per-module C data.

Why? Isn't an extension type a much simpler and substantially faster thing
to use than an indirection through a capsule? Are we really encouraging
users to let CPython do a string concatenation, Python string object
creation, module attribute lookup and pointer extraction, just to access
some value in the current module state? That sounds like a horrible amount
of overhead.

While a custom module extension type might not be entirely trivial to set
up manually, it's still mostly just copy&paste (i.e. simple enough) and
provides largely superior performance: a simple pointer indirection instead
of the entire lookup dance above.

Why not just rely on PyModule_GetState() for the time being? If we ever
need to extend that mechanism and pass a different module object type into
Exec(), that gives us a single place to support different (future) module
types as well. And code that implements and returns its own module type
from Create() can and will do its own straight forward cast anyway.


>         void *PyModule_GetCapsulePointer(
>             PyObject *module,
>             const char *module_name,
>             const char *attribute_name)
> 
>     Returns the pointer stored in *module* as *attribute_name*, or NULL
>     (with an exception set) on failure. The capsule name is formed by joining
>     *module_name* and *attribute_name* by a dot.
> 
>     This convenience function can be used instead of separate calls to
>     PyObject_GetAttr and PyCapsule_GetPointer.

But that requires the user code to know the module name in all places where
module state is needed (i.e. almost everywhere). Doesn't that counter the
idea of passing the module spec into the Create function?

And why is it necessary to pass the C encoded module name if the module
itself (which knows its name as a readily prepared Python string) is the
very first argument?

BTW, it's worth mentioning the expected encoding of the C encoded names.
UTF-8, I guess.


> Generalizing PyModule_* functions
> ---------------------------------
> 
> The following functions and macros will be modified to work on any object
> that supports attribute access:
> 
>     * PyModule_GetNameObject
>     * PyModule_GetName
>     * PyModule_GetFilenameObject
>     * PyModule_GetFilename
>     * PyModule_AddIntConstant
>     * PyModule_AddStringConstant
>     * PyModule_AddIntMacro
>     * PyModule_AddStringMacro
>     * PyModule_AddObject
> 
> The PyModule_GetDict function will continue to only work on true module
> objects. This means that it should not be used on extension modules that only
> define PyModuleExec.

That leads to somewhat unfortunate API naming, but I think it's acceptable.

PyModule_GetState() is also worth mentioning here, in the same way as
GetDict().


> Legacy Init
> -----------
> 
> If PyModuleExec is not defined, the import machinery will try to initialize
> the module using the PyModuleInit hook, as described in PEP 3121.

The name is "PyInit_modulename".


> If PyModuleExec is defined, PyModuleInit will be ignored.
> Modules requiring compatibility with previous versions of CPython may
> implement PyModuleInit in addition to the new hook.

I guess the idea would be to implement PyInit() by calling either Create()
or PyModule_Create(), and then Exec(), right? Should we suggest that in the
PEP?


> Subinterpreters and Interpreter Reloading
> -----------------------------------------
> 
> Extensions using the new initialization scheme are expected to support
> subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly.
> The mechanism is designed to make this easy, but care is still required
> on the part of the extension author.

Would be nice to add a quick note that subinterpreter support basically
means that the Create/Exec dance will be repeated for each interpreter
instance, and that the module object will be garbage collected at the end
of each interpreter life cycle.


> No user-defined functions, methods, or instances may leak to different
> interpreters.
> To achieve this, all module-level state should be kept in either the module
> dict, or in the module object.
> A simple rule of thumb is: Do not define any static data, except built-in
> types with no mutable or user-settable class attributes.

I think it's also worth mentioning C level callbacks explicitly, since that
can be quite tricky in some cases (it's one of the top-FAQs by Cython
users). Whatever state is passed into the callback mechanism must include a
direct or indirect reference to the module or state object as well if
module state is used by the callback in any way (which is not unlikely).


> Module Reloading
> ----------------
> 
> Reloading an extension module will re-execute its PyModuleInit function.

"Exec", as Nick already found. Worth mentioning explicitly that Create()
will not be called again and that the object that Exec() receives is the
same as returned by the original call to Create().


> Similar caveats apply to reloading an extension module as to reloading
> a Python module. Notably, attributes or any other state of the module
> are not reset before reloading.

Interesting - is Exec() allowed to take advantage of that by not resetting
some well selected attributes? E.g. constant global caches? Although I
guess that would counter the idea of reloading a module...


> Additionally, due to limitations in shared library loading (both dlopen on
> POSIX and LoadModuleEx on Windows), it is not generally possible to load
> a modified library after it has changed on disk.
> Therefore, reloading extension modules is of limited use.

Well, it could potentially use a hash suffix in the file name and still
load under the same module name. See right below.


> Multiple modules in one library
> -------------------------------
> 
> To support multiple Python modules in one shared library, the library
> must export appropriate PyModuleExec_<name> or PyModuleCreate_<name> hooks
> for each exported module.
> The modules are loaded using a ModuleSpec with origin set to the name of the
> library file, and name set to the module name.
> 
> Note that this mechanism can currently only be used to *load* such modules,
> not to *find* them.
> 
> XXX: This is an existing issue; either fix it/wait for a fix or provide
> an example of how to load such modules.

I really like that idea. It's essentially an extended inittab mechanism,
also usable for executable single-file distributions (maybe even "python
-m"), non-ASCII module names and "__init__.so" packages that import as an
entire package structure of multiple modules.

Needs some kind of "import module from library" C-API mechanism, though, or
at least an explicitly exported list of modules to import from a shared
library in the right order. I'd rather go for some kind of explicit import
that creates these modules on request.

Stefan




More information about the Import-SIG mailing list