[Import-SIG] PEP 489: Multi-phase extension module initialization; version 5

Wed May 20 01:56:34 CEST 2015

On Mon, May 18, 2015 at 9:51 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On 19 May 2015 at 10:07, Eric Snow <ericsnowcurrently at gmail.com> wrote:
  [snip]
>> Was there any consideration made for just ignoring unknown slot IDs?
>> My gut reaction is that you have it the right way, but I can still
>> imagine use cases for custom slots that PyModuleDef_Init wouldn't know
>> about.
>
> The "known slots only, all other slot IDs are reserved for future use"
> slot semantics were copied directly from PyType_FromSpec in PEP 384.
> Since it's just a numeric slot ID, you'd run a high risk of conflicts
> if you allowed for custom extensions.
>
> If folks want to do more clever things, they'll need to use the create
> or exec slot to stash them on the module object, rather than storing
> them in the module definition.

Makes sense.  This does remind me of something I wanted to ask.  Would
it make sense to leverage ModuleSpec.loader_state?  If I recall
correctly, we added loader_state with extension modules in mind.

>
>>> The PyModuleDef object must be available for the lifetime of the module
>>> created
>>> from it – usually, it will be declared statically.
>>
>> How easily will this be a source of mysterious errors-at-a-distance?
>
> It shouldn't be any worse than static type definitions, and normal
> reference counting semantics should keep it alive regardless.

Got it.

>
>>> [snip]
>>> Extension authors are advised to keep Py_mod_create minimal, an in
>>> particular
>>> to not call user code from it.
>>
>> This is a pretty important point as well.  We'll need to make sure
>> this is sufficiently clear in the documentation.  Would it make sense
>> to provide helpers for common cases, to encourage extension authors to
>> keep the create function minimal?
>
> The main encouragement is to not handcode your extension modules at
> all, and let something like Cython or SWIG take care of the
> boilerplate :)

Hey, I tried to make something happen over on python-ideas! :)  Some
folks just don't want to go far enough.

  [snip]
>> Could you elaborate?  What are those use cases and why would
>> Py_mod_create be better?
>
> Rather than replacing the implicitly created normal module during
> Py_mod_exec (which is the only option available to Python modules),
> PEP 489 lets you define the Py_mod_create slot to override the module
> object creation directly.
>
> Outside conversion of a Python module that manipulates sys.modules to
> an extension module with Cython, there's no real reason to use the
> "replacing yourself in sys.modules" option over using Py_mod_create
> directly.

Ah, I got it.  We just want to ensure we match Python module behavior,
where there is no module-defined create step.  This would seem even
more important with tools like Cython that convert Python modules into
C extensions, even if the appropriate solution for a C extension
module would be a different approach (e.g. Py_mod_create).

  [snip]
>> Given your example below, "should not" seems a bit strong to me.  In
>> fact, what are the objections to encouraging the approach from the
>> example?
>
> Agreed, "should not" is probably too strong here. On the other hand,
> preserving compatibility with older Python versions in a module that
> has been updated to rely on multi-phase initialization is likely to be
> a matter of "graceful degradation", rather than being able to
> reproduce comparable functionality (which I believe may have been the
> point Petr was trying to convey).

Understood.  This section could stand to be clarified then.

>
> I expect Cython and SWIG may be able to manage that through
> appropriate use of #ifdef's in the generated code, but doing it by
> hand is likely to be painful, hence the potential benefits of just
> sticking with single-phase initialisation for the time being.

Hmm.  The example made it look relatively straight-forward.
Regardless, it's not a big deal.

>
>>> [snip]
>>>
>>> Subinterpreters and Interpreter Reloading
>>> -----------------------------------------
>>>
>>> Extensions using the new initialization scheme are expected to support
>>> subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly.
>>
>> Presumably this support is explicitly and completely defined in the
>> subsequent sentences.  Is it really just keeping "hidden" module state
>> encapsulated on the module object?  If not then it may make sense to
>> enumerate the requirements better for the sake of extension module
>> authors.
>
> I'd actually like to have a better way of doing scenario testing for
> extension modules (subinterpreters, multiple initialize/finalize
> cycles, freezing), but I'm not sure this PEP is the best place to
> define that. Perhaps we could do a PyPI project that was a tox-based
> test battery for this kind of thing?

Interesting idea.  I think that a lot of folks would find that useful.
It feels a bit like some of the work Dave Malcolm did with validating
extension modules.

>
>>> The mechanism is designed to make this easy, but care is still required
>>> on the part of the extension author.
>>> No user-defined functions, methods, or instances may leak to different
>>> interpreters.
>>> To achieve this, all module-level state should be kept in either the module
>>> dict, or in the module object's storage reachable by PyModule_GetState.
>>
>> Is this programmatically enforceable?  Is there any mechanism for
>> easily copying module state?  How about sharing some state between
>> subinterpreters?  How much room is there for letting extension module
>> authors define how their module behaves across multiple interpreters
>> or across multiple Initialize/Finalize cycles?
>
> It's not programmatically enforcable, hence the idea above of finding
> a way to make it easier for people to test their extension modules are
> importable across multiple Python versions and deployment scenarios.

That's what I figured.

>
>>> As a rule of thumb, modules that rely on PyState_FindModule are, at the
>>> moment,
>>> not good candidates for porting to the new mechanism.
>>
>> Are there any plans for a follow-up effort to help with this case?
>
> The problem here is that the PEP 3121 module state approach provides
> storage on a *per-interpreter* basis, that is then shared amongst all
> module instances created from a given module definition.

You mean a form of interpreter-local storage?  Also, the module
definition is effectively global right?

>
> This means that when _PyImport_FindExtensionObject (see
> https://hg.python.org/cpython/file/fc2eed9fc2d0/Python/import.c#l518)
> reinitialises an extension module, the state is shared between the two
> instances. When PEP 3121 was written, this was not seen as a problem,
> since the expectation was that the behaviour would only be triggered
> by multiple interpreter level initialize/finalize cycles.
>
> One key scenario we missed at the time was "deleting an extension
> module from sys.modules and importing it a second time, while
> retaining a local reference for later restoration". Under PEP 3121,
> the two instances collide on their state storage, as we have two
> simultaneously existing module objects created in the same interpreter
> from the same module definition. PEP 489 would inherit that same
> problem if you tried to use it with the PyState_* APIs, so it simply
> doesn't allow them at all. (Earlier versions of the PEP allowed it
> with an "EXPORT_SINGLETON" slot that would disallow reimporting
> entirely, which we took out in favour of "just keep using the existing
> initialisation model in those cases for the time being")

That seems reasonable.

>
> For pure Python code, we don't have this problem, since the
> interpreter takes care of providing a properly scoped globals()
> reference to *all* functions defined in that module, regardless of
> whether they're module level functions or method definitions on a
> class. At the C level, we don't have that, as only module level
> functions get a module reference passed in - methods only get a
> reference to their class instance, without a reference to the module
> globals, and delayed callbacks can be a problem as well.

Yuck.  Is this something we could fix?  Is __module__ not set on all functions?

>
> The best improved API we could likely offer at this point is a
> convenience API for looking up a module in *sys.modules* based on a
> PyModuleDef instance, and updating PEP 489 to write the as-imported
> module name into the returned PyModuleDef structure. That's probably
> not a bad way to go, given that PEP 489 currently *ignores* the m_name
> slot - flipping it around to be a *writable* slot would be a way to
> let extension modules know dynamically how to look themselves up in
> sys.modules.

That sounds useful.

>
> The new lookup API would then be the moral equivalent of Python code
> doing "mod = sys.modules[__name__]". With this approach, actively
> *using* multiple references to a given module at the same time would
> still break (since you'll always get the module currently in
> sys.modules, even if that isn't the one you expected), but the
> "save-and-restore" model needed for certain kinds of testing and
> potentially other scenarios would work correctly.

Right, though I would expect there to be trouble if the replacement
module didn't support the module state API in the expected way.

>
>>> Module Reloading
>>> ----------------
>>>
>>> Reloading an extension module using importlib.reload() will continue to
>>> have no effect, except re-setting import-related attributes.
>>>
>>> Due to limitations in shared library loading (both dlopen on POSIX and
>>> LoadModuleEx on Windows), it is not generally possible to load
>>> a modified library after it has changed on disk.
>>>
>>> Use cases for reloading other than trying out a new version of the module
>>> are too rare to require all module authors to keep reloading in mind.
>>> If reload-like functionality is needed, authors can export a dedicated
>>> function for it.
>>
>> Keep in mind the semantics of reload for pure Python modules.  The
>> module is executed into the existing namespace, overwriting the loaded
>> namespace but leaving non-colliding attributes alone.  While the
>> semantics for reloading an extension/builtin/frozen module are
>> currently basic (i.e. a no-op), there may well be room to support
>> reload behavior that mirrors that of pure Python modules without
>> needing to reload an SO file.  I would expect either the behavior of
>> exec to get repeated (tricky due to "hidden" module state?) or for
>> there to be a "reload" slot that would mirror Py_mod_exec.
>
> We considered this, and decided it was fairly pointless, since you
> can't modify the extension module code. The one case I see where it
> potentially makes sense is a "transitive reload", where the extension
> module retrieves and caches attributes from another pure Python module
> at import time, and that extension module has been reloaded.

The reload approach specified in the PEP seems satisfactory at this point.

>
> It may also make a difference in the context of utilities like
> https://docs.python.org/3/library/test.html#test.support.import_fresh_module,
> where we manipulate the import system state to control how conditional
> imports are handled.
>
>> At the same time, one may argue that reloading modules is not
>> something to encourage. :)
>
> There's a reason import_fresh_module has never made it out of test.support :)
>
>>> Multiple modules in one library
>>> -------------------------------
>>>
>>> To support multiple Python modules in one shared library, the library can
>>> export additional PyInit* symbols besides the one that corresponds
>>> to the library's filename.
>>>
>>> Note that this mechanism can currently only be used to *load* extra modules,
>>> but not to *find* them.
>>
>> What do you mean by "currently"?
>
> It's a limitation of the way the existing finders work, rather than an
> inherent limitation of the import system as a whole.

Ah.  It sounded like the PEP was leading to some future solution to
resolve that.

>
>> It may also be worth tying the above statement with the following
>> text, since the following appears to be an explanation of how to
>> address the "finder" caveat.
>
> Agreed that this could be clearer.
>
>>> Testing and initial implementations
>>> -----------------------------------
>>>
>>> For testing, a new built-in module ``_testmultiphase`` will be created.
>>> The library will export several additional modules using the mechanism
>>> described in "Multiple modules in one library".
>>>
>>> The ``_testcapi`` module will be unchanged, and will use single-phase
>>> initialization indefinitely (or until it is no longer supported).
>>>
>>> The ``array`` and ``xx*`` modules will be converted to use multi-phase
>>> initialization as part of the initial implementation.
>>
>> What do you mean by "initial implementation"?  Will it be done
>> differently in a later implementation?
>
> These modules will be converted in the reference implementation, other
> modules won't be.

That's what I thought.  The use of the word "initial" threw me off.

>
>>> String constants and types can be handled similarly.
>>> (Note that non-default bases for types cannot be portably specified
>>> statically; this case would need a Py_mod_exec function that runs
>>> before the slots are added. The free error-checking would still be
>>> beneficial, though.)
>>
>> This implies to me that now is the time to ensure that this PEP
>> appropriately accommodates that need.  It would be unfortunate if we
>> had to later hack in some extra API to accommodate a use case we
>> already know about.  Better if we made sure the currently proposed
>> changes could accommodate the need, even if the implementation of that
>> part were not part of this PEP.
>
> This would be a new kind of execution slot, so the PEP already
> accommodates these possible future extensions.

Sounds good.  The explanation made it sound like a mechanism would be
required that could not be handled via a slot.

>
>>> Another possibility is providing a "main" function that would be run
>>> when the module is given to Python's -m switch.
>>> For this to work, the runpy module will need to be modified to take
>>> advantage of ModuleSpec-based loading introduced in PEP 451.
>>
>> I'll point out that the pure-Python equivalent has been proposed on a
>> number of occasions and been rejected every time.  However, in the
>> case of extension modules it is more justifiable.  If extension
>> modules gain such a mechanism then it may be a justification for doing
>> something similar in Python.
>>
>>> Also, it will be necessary to add a mechanism for setting up a module
>>> according to slots it wasn't originally defined with.
>>
>> What does this mean?
>
> When you use the -m switch, you always run in the builtin __main__
> module namespace, and runpy fiddles with __main__.__spec__ to match
> the details of the module passed to the switch. That's not currently a
> trick we can manage when the "thing to run" is an extension module.

I see now.

-eric