[Import-SIG] PEP 489: Multi-phase extension module initialization; version 5

Tue May 19 02:07:57 CEST 2015

Thanks for working on this, Petr (et al.).  Sorry I've missed the
previous discussion.  Comments are in-line.

-eric

On Mon, May 18, 2015 at 8:02 AM, Petr Viktorin <encukou at gmail.com> wrote:
> [snip]
>
> Furthermore, the majority of currently existing extension modules has
> problems with sub-interpreter support and/or interpreter reloading, and,
> while
> it is possible with the current infrastructure to support these
> features, it is neither easy nor efficient.
> Addressing these issues was the goal of PEP 3121, but many extensions,
> including some in the standard library, took the least-effort approach
> to porting to Python 3, leaving these issues unresolved.
> This PEP keeps backwards compatibility, which should reduce pressure and
> give
> extension authors adequate time to consider these issues when porting.

So just be to sure I understand, now PyModuleDef.m_slots will
unambiguously indicate whether or not an extension module is
compliant, right?

> [snip]
>
> The proposal
> ============

This section should include an indication of how the loader (and
perhaps finder) will change for builtin, frozen, and extension
modules.  It may help to describe the proposal up front by how the
loader implementation would look if it were somehow implemented in
Python code.  The subsequent sections sometimes indicate where
different things take place, but an explicit outline (as Python code)
would make the entire flow really obvious.  Putting that toward the
beginning of this section would help clearly set the stage for the
rest of the proposal.

> [snip]
> Unknown slot IDs will cause the import to fail with SystemError.

Was there any consideration made for just ignoring unknown slot IDs?
My gut reaction is that you have it the right way, but I can still
imagine use cases for custom slots that PyModuleDef_Init wouldn't know
about.

>
> When using multi-phase initialization, the *m_name* field of PyModuleDef
> will
> not be used during importing; the module name will be taken from the
> ModuleSpec.

So m_name will be strictly ignored by PyModuleDef_Init?

>
> To prevent crashes when the module is loaded in older versions of Python,
> the PyModuleDef object must be initialized using the newly added
> PyModuleDef_Init function.
> For example, an extension module "example" would be exported as::
>
>     static PyModuleDef example_def = {...}
>
>     PyMODINIT_FUNC
>     PyInit_example(void)
>     {
>         return PyModuleDef_Init(&example_def);
>     }

This example is helpful. :)

>
> The PyModuleDef object must be available for the lifetime of the module
> created
> from it – usually, it will be declared statically.

How easily will this be a source of mysterious errors-at-a-distance?

> [snip]
> However, only ModuleType instances support module-specific functionality
> such as per-module state.

This is a pretty important point.  Presumably this constraints later
behavior and precedes all functionality related to per-module state.

> [snip]
> Extension authors are advised to keep Py_mod_create minimal, an in
> particular
> to not call user code from it.

This is a pretty important point as well.  We'll need to make sure
this is sufficiently clear in the documentation.  Would it make sense
to provide helpers for common cases, to encourage extension authors to
keep the create function minimal?

> [snip]
>
> If PyModuleExec replaces the module's entry in sys.modules,
> the new object will be used and returned by importlib machinery.

Just to be sure, something like "mod = sys.modules[modname]" is done
before each execution slot.  In other words, the result of the
previous execution slot should be used for the next one.

> (This mirrors the behavior of Python modules. Note that implementing
> Py_mod_create is usually a better solution for the use cases this serves.)

Could you elaborate?  What are those use cases and why would
Py_mod_create be better?

> [snip]
>
> Modules that need to work unchanged on older versions of Python should not
> use multi-phase initialization, because the benefits it brings can't be
> back-ported.

Given your example below, "should not" seems a bit strong to me.  In
fact, what are the objections to encouraging the approach from the
example?

> Nevertheless, here is an example of a module that supports multi-phase
> initialization, and falls back to single-phase when compiled for an older
> version of CPython::
>
>     #include <Python.h>
>
>     static int spam_exec(PyObject *module) {
>         PyModule_AddStringConstant(module, "food", "spam");
>         return 0;
>     }
>
>     #ifdef Py_mod_exec
>     static PyModuleDef_Slot spam_slots[] = {
>         {Py_mod_exec, spam_exec},
>         {0, NULL}
>     };
>     #endif
>
>     static PyModuleDef spam_def = {
>         PyModuleDef_HEAD_INIT,                      /* m_base */
>         "spam",                                     /* m_name */
>         PyDoc_STR("Utilities for cooking spam"),    /* m_doc */
>         0,                                          /* m_size */
>         NULL,                                       /* m_methods */
>     #ifdef Py_mod_exec
>         spam_slots,                                 /* m_slots */
>     #else
>         NULL,
>     #endif
>         NULL,                                       /* m_traverse */
>         NULL,                                       /* m_clear */
>         NULL,                                       /* m_free */
>     };
>
>     PyMODINIT_FUNC
>     PyInit_spam(void) {
>     #ifdef Py_mod_exec
>         return PyModuleDef_Init(&spam_def);
>     #else
>         PyObject *module;
>         module = PyModule_Create(&spam_def);
>         if (module == NULL) return NULL;
>         if (spam_exec(module) != 0) {
>             Py_DECREF(module);
>             return NULL;
>         }
>         return module;
>     #endif
>     }
>

This example is really helpful!

> [snip]
>
> Subinterpreters and Interpreter Reloading
> -----------------------------------------
>
> Extensions using the new initialization scheme are expected to support
> subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly.

Presumably this support is explicitly and completely defined in the
subsequent sentences.  Is it really just keeping "hidden" module state
encapsulated on the module object?  If not then it may make sense to
enumerate the requirements better for the sake of extension module
authors.

> The mechanism is designed to make this easy, but care is still required
> on the part of the extension author.
> No user-defined functions, methods, or instances may leak to different
> interpreters.
> To achieve this, all module-level state should be kept in either the module
> dict, or in the module object's storage reachable by PyModule_GetState.

Is this programmatically enforceable?  Is there any mechanism for
easily copying module state?  How about sharing some state between
subinterpreters?  How much room is there for letting extension module
authors define how their module behaves across multiple interpreters
or across multiple Initialize/Finalize cycles?

> A simple rule of thumb is: Do not define any static data, except
> built-in types
> with no mutable or user-settable class attributes.

This is another one of those points that needs to be crystal clear in the docs.

> As a rule of thumb, modules that rely on PyState_FindModule are, at the
> moment,
> not good candidates for porting to the new mechanism.

Are there any plans for a follow-up effort to help with this case?

> [snip]
>
> Module Reloading
> ----------------
>
> Reloading an extension module using importlib.reload() will continue to
> have no effect, except re-setting import-related attributes.
>
> Due to limitations in shared library loading (both dlopen on POSIX and
> LoadModuleEx on Windows), it is not generally possible to load
> a modified library after it has changed on disk.
>
> Use cases for reloading other than trying out a new version of the module
> are too rare to require all module authors to keep reloading in mind.
> If reload-like functionality is needed, authors can export a dedicated
> function for it.

Keep in mind the semantics of reload for pure Python modules.  The
module is executed into the existing namespace, overwriting the loaded
namespace but leaving non-colliding attributes alone.  While the
semantics for reloading an extension/builtin/frozen module are
currently basic (i.e. a no-op), there may well be room to support
reload behavior that mirrors that of pure Python modules without
needing to reload an SO file.  I would expect either the behavior of
exec to get repeated (tricky due to "hidden" module state?) or for
there to be a "reload" slot that would mirror Py_mod_exec.

At the same time, one may argue that reloading modules is not
something to encourage. :)

>
>
> Multiple modules in one library
> -------------------------------
>
> To support multiple Python modules in one shared library, the library can
> export additional PyInit* symbols besides the one that corresponds
> to the library's filename.
>
> Note that this mechanism can currently only be used to *load* extra modules,
> but not to *find* them.

What do you mean by "currently"?

It may also be worth tying the above statement with the following
text, since the following appears to be an explanation of how to
address the "finder" caveat.

>
> Given the filesystem location of a shared library and a module name,
> a module may be loaded with::
>
>     import importlib.machinery
>     import importlib.util
>     loader = importlib.machinery.ExtensionFileLoader(name, path)
>     spec = importlib.util.spec_from_loader(name, loader)
>     module = importlib.util.module_from_spec(spec)
>     loader.exec_module(module)
>     return module
>
> On platforms that support symbolic links, these may be used to install one
> library under multiple names, exposing all exported modules to normal
> import machinery.
>
>
> Testing and initial implementations
> -----------------------------------
>
> For testing, a new built-in module ``_testmultiphase`` will be created.
> The library will export several additional modules using the mechanism
> described in "Multiple modules in one library".
>
> The ``_testcapi`` module will be unchanged, and will use single-phase
> initialization indefinitely (or until it is no longer supported).
>
> The ``array`` and ``xx*`` modules will be converted to use multi-phase
> initialization as part of the initial implementation.

What do you mean by "initial implementation"?  Will it be done
differently in a later implementation?

>
>
> Summary of API Changes and Additions
> ------------------------------------
>
> New functions:
>
> * PyModule_FromDefAndSpec (macro)
> * PyModule_FromDefAndSpec2
> * PyModule_ExecDef
> * PyModule_SetDocString
> * PyModule_AddFunctions
> * PyModuleDef_Init
>
> New macros:
>
> * Py_mod_create
> * Py_mod_exec
>
> New types:
>
> * PyModuleDef_Type will be exposed
>
> New structures:
>
> * PyModuleDef_Slot
>
> PyModuleDef.m_reload changes to PyModuleDef.m_slots.

This section is missing any explanation of the impact on
Python/import.c, on the _imp/imp module, and on the 3 finders/loaders
in Lib/importlib/_bootstrap[_external].py (builtin/frozen/extension).

>
>
> Possible Future Extensions
> ==========================
>
> The slots mechanism, inspired by PyType_Slot from PEP 384,
> allows later extensions.
>
> Some extension modules exports many constants; for example _ssl has
> a long list of calls in the form::
>
>     PyModule_AddIntConstant(m, "SSL_ERROR_ZERO_RETURN",
>                             PY_SSL_ERROR_ZERO_RETURN);
>
> Converting this to a declarative list, similar to PyMethodDef,
> would reduce boilerplate, and provide free error-checking which
> is often missing.

Great idea, including as it applies to other constants and types.

>
> String constants and types can be handled similarly.
> (Note that non-default bases for types cannot be portably specified
> statically; this case would need a Py_mod_exec function that runs
> before the slots are added. The free error-checking would still be
> beneficial, though.)

This implies to me that now is the time to ensure that this PEP
appropriately accommodates that need.  It would be unfortunate if we
had to later hack in some extra API to accommodate a use case we
already know about.  Better if we made sure the currently proposed
changes could accommodate the need, even if the implementation of that
part were not part of this PEP.

>
> Another possibility is providing a "main" function that would be run
> when the module is given to Python's -m switch.
> For this to work, the runpy module will need to be modified to take
> advantage of ModuleSpec-based loading introduced in PEP 451.

I'll point out that the pure-Python equivalent has been proposed on a
number of occasions and been rejected every time.  However, in the
case of extension modules it is more justifiable.  If extension
modules gain such a mechanism then it may be a justification for doing
something similar in Python.

> Also, it will be necessary to add a mechanism for setting up a module
> according to slots it wasn't originally defined with.

What does this mean?

>
>
> Implementation
> ==============
>
> Work-in-progress implementation is available in a Github repository
> [#gh-repo]_;
> a patchset is at [#gh-patch]_.

I'll have to take a look.

> [snip]