[Import-SIG] PEP 489: Multi-phase extension module initialization; version 5

Wed May 20 02:22:47 CEST 2015

On Tue, May 19, 2015 at 5:06 AM, Petr Viktorin <encukou at gmail.com> wrote:
> On 05/19/2015 05:51 AM, Nick Coghlan wrote:
>> On 19 May 2015 at 10:07, Eric Snow <ericsnowcurrently at gmail.com> wrote:
>>> On Mon, May 18, 2015 at 8:02 AM, Petr Viktorin <encukou at gmail.com> wrote:
>>>> [snip]
>>>>
>>>> Furthermore, the majority of currently existing extension modules has
>>>> problems with sub-interpreter support and/or interpreter reloading, and,
>>>> while
>>>> it is possible with the current infrastructure to support these
>>>> features, it is neither easy nor efficient.
>>>> Addressing these issues was the goal of PEP 3121, but many extensions,
>>>> including some in the standard library, took the least-effort approach
>>>> to porting to Python 3, leaving these issues unresolved.
>>>> This PEP keeps backwards compatibility, which should reduce pressure and
>>>> give
>>>> extension authors adequate time to consider these issues when porting.
>>>
>>> So just be to sure I understand, now PyModuleDef.m_slots will
>>> unambiguously indicate whether or not an extension module is
>>> compliant, right?
>>
>> I'm not sure what you mean by "compliant". A non-NULL m_slots will
>> indicate usage of multi-phase initialisation, so it at least indicates
>> *intent* to correctly support subinterpreters et al. Actual delivery
>> on that promise is still a different question :)
>
> Yes, non-NULL m_slots means the module is compliant. If it's not, it's a
> bug in the *module* (i.e. compliance is not *just* a matter of setting
> setting m_slots).
> This will be explained in the docs.

Perfect.

>
>>>> [snip]
>>>>
>>>> The proposal
>>>> ============
>>>
>>> This section should include an indication of how the loader (and
>>> perhaps finder) will change for builtin, frozen, and extension
>>> modules.  It may help to describe the proposal up front by how the
>>> loader implementation would look if it were somehow implemented in
>>> Python code.  The subsequent sections sometimes indicate where
>>> different things take place, but an explicit outline (as Python code)
>>> would make the entire flow really obvious.  Putting that toward the
>>> beginning of this section would help clearly set the stage for the
>>> rest of the proposal.
>>
>> +1 for a pseudo-code overview of the loader implementation.
>
> OK. Along with a link to PEP 451 code [*], it should make things clearer.
> [*] https://www.python.org/dev/peps/pep-0451/#how-loading-will-work

Sounds good.

>
>>>> [snip]
>>>> Unknown slot IDs will cause the import to fail with SystemError.
>>>
>>> Was there any consideration made for just ignoring unknown slot IDs?
>>> My gut reaction is that you have it the right way, but I can still
>>> imagine use cases for custom slots that PyModuleDef_Init wouldn't know
>>> about.
>>
>> The "known slots only, all other slot IDs are reserved for future use"
>> slot semantics were copied directly from PyType_FromSpec in PEP 384.
>> Since it's just a numeric slot ID, you'd run a high risk of conflicts
>> if you allowed for custom extensions.
>>
>> If folks want to do more clever things, they'll need to use the create
>> or exec slot to stash them on the module object, rather than storing
>> them in the module definition.
>
> Right, if you need custom behavior, put it in a function and use the
> provided hook. (If you need custom "slots" on PyModuleDef for some
> reason, use a PyModuleDef subclass -- but I can't see where it would be
> helpful.)
> Ignoring unknown slot IDs would mean letting errors go unnoticed.

This is reasonable.  Thanks.

>
> (Technicality: PyModuleDef_Init doesn't care about slots;
> PyModule_FromDefAndSpec and PyModule_ExecDef do. and they will raise the
> errors.)
>
>>> When using multi-phase initialization, the *m_name* field of PyModuleDef
>>> will
>>> not be used during importing; the module name will be taken from the
>>> ModuleSpec.
>>
>> So m_name will be strictly ignored by PyModuleDef_Init?
>
> Yes. The name is useful for introspection, but the import machinery will
> use the name provided by the ModuleSpec.

Okay.

>
> (Technicality: again, PyModuleDef_Init doesn't touch names at all.
> PyModule_FromDefAndSpec and PyModule_ExecDef do, and they will ignore
> the name from the def.)
>
>>>> The PyModuleDef object must be available for the lifetime of the module
>>>> created
>>>> from it – usually, it will be declared statically.
>>>
>>> How easily will this be a source of mysterious errors-at-a-distance?
>>
>> It shouldn't be any worse than static type definitions, and normal
>> reference counting semantics should keep it alive regardless.
>
> It's the the same as the current behavior (PEP 3121), where a
> PyModuleDef is stored in the module, and if you let it die,
> PyModule_GetState will give you an invalid pointer. It's just that in
> PEP 489, the import machinery itself uses def, so you actually get to
> feel the pain if you deallocate it.
> All in all, this should not be a problem in practice; the PEP specifies
> what'll happen if you go off doing exotic things. (For example, Cython
> might run into this if it tries implementing a reloading scheme we
> talked about earlier in the thread, and even then it shouldn't be a
> major source of mysterious errors.) Normal mortals will be OK.

Thanks for explaining.  I'm less concerned now.

>
>>> [snip]
>>> However, only ModuleType instances support module-specific functionality
>>> such as per-module state.
>>
>> This is a pretty important point.  Presumably this constraints later
>> behavior and precedes all functionality related to per-module state.
>
> Yes. Module objects support more module-like behavior than other
> objects. What you can and cannot use should be clear from the API. I'll
> clarify a bit more what functionality depends on using a PyModule_Type
> (or subclass) instance.
> One thing I see I forgot to add is that execution slots are looked up
> via PyModule_GetDef, so they won't be processed on non-module objects.

Okay.  That makes sense now.

>
> It's a very good idea to use a module subclass rather than a completely
> custom object. The docs will need to strongly recommend this.

Agreed.  And the docs should also be clear on how non-module objects
are basically ignored, slot-wise.

>
>>>> [snip]
>>>> Extension authors are advised to keep Py_mod_create minimal, an in
>>>> particular
>>>> to not call user code from it.
>>>
>>> This is a pretty important point as well.  We'll need to make sure
>>> this is sufficiently clear in the documentation.  Would it make sense
>>> to provide helpers for common cases, to encourage extension authors to
>>> keep the create function minimal?
>>
>> The main encouragement is to not handcode your extension modules at
>> all, and let something like Cython or SWIG take care of the
>> boilerplate :)
>
> Yes, Cython should be default. For hand-written modules, the common case
> should be not defining create at all.

The docs should be explicit about this.

>
>>>> [snip]
>>>>
>>>> If PyModuleExec replaces the module's entry in sys.modules,
>>>> the new object will be used and returned by importlib machinery.
>>>
>>> Just to be sure, something like "mod = sys.modules[modname]" is done
>>> before each execution slot.  In other words, the result of the
>>> previous execution slot should be used for the next one.
>>
>> That's not the original intent of this paragraph - rather, it is
>> referring to the existing behaviour of the import machinery.
>>
>> However, I agree that now we're allowing the Py_mod_exec slot to be
>> supplied multiple times, we should also be updating the module
>> reference between slot invocations.
>
> No, that won't work. It's possible (via direct calls to the import
> machinery) to load a module without adding it to sys.modules.

What direct calls do you mean?  I would not expect any such mechanism
to work properly with extension modules.

> The behavior should be clear (when you think about it) after I include
> the loader implementation pseudocode.

Okay.

>
>> I also think the PEP could do with a brief mention of the additional
>> modularity this approach brings at the C level - rather than having to
>> jam everything into one function, an extension module can easily break
>> up its initialisation into multiple steps, and its technically even
>> possible to share common steps between different modules.
>
> Eh, I think it's better to create one function that calls the parts,
> which was always possible, and works just as well.
> Repeating slots is allowed because it would be an unnecessary bother to
> check for duplicates. It's not a feature to advertise, the PEP just
> specifies that in the weird edge case, the intuitive thing will happen.

Be that as it may, I think it would be a mistake to treat support for
multiple exec slots as a second-class citizen in the design.
Personally I find it an appealing feature.

>
> (I did have a useful future use case for repeated slots, but the current
> PEP allows a better and more obvious solution so I'll not even mention
> it again.)
>
> Still, the steps are processed in a loop from a single function
> (PyModule_ExecDef), and that function operates on a module object -- it
> doesn't know about sys.modules and can't easily check if you replaced
> the module somewhere.

I would consider this approach to be a mistake as well.  The approach
should stay consistent with the semantics of the whole import system,
where sys.modules is checked directly.  Unfortunately, that ship has
already sailed.

>
>>>> (This mirrors the behavior of Python modules. Note that implementing
>>>> Py_mod_create is usually a better solution for the use cases this serves.)
>>>
>>> Could you elaborate?  What are those use cases and why would
>>> Py_mod_create be better?
>>
>> Rather than replacing the implicitly created normal module during
>> Py_mod_exec (which is the only option available to Python modules),
>> PEP 489 lets you define the Py_mod_create slot to override the module
>> object creation directly.
>>
>> Outside conversion of a Python module that manipulates sys.modules to
>> an extension module with Cython, there's no real reason to use the
>> "replacing yourself in sys.modules" option over using Py_mod_create
>> directly.
>
> Yes. The workaround you need to use in Python modules is possible for
> extensions, but there's no reason to use it. I'll try to make it clearer
> that it's an unnecessary workaround.

Thank you.

>
>>>> [snip]
>>>>
>>>> Modules that need to work unchanged on older versions of Python should not
>>>> use multi-phase initialization, because the benefits it brings can't be
>>>> back-ported.
>>>
>>> Given your example below, "should not" seems a bit strong to me.  In
>>> fact, what are the objections to encouraging the approach from the
>>> example?
>>
>> Agreed, "should not" is probably too strong here. On the other hand,
>> preserving compatibility with older Python versions in a module that
>> has been updated to rely on multi-phase initialization is likely to be
>> a matter of "graceful degradation", rather than being able to
>> reproduce comparable functionality (which I believe may have been the
>> point Petr was trying to convey).
>
> My point is that if you need graceful degradation, your best bet is to
> stick with single-phase init. Then you'll have one code path that works
> the same on all versions.
> If you *need* the features of multi-phase init, you need to remove
> support for Pythons that don't have it.
> If you need both backwards compatibility and multi-phase init, you
> essentially need to create two modules (with shared contents), and make
> sure they end up in the same state after they're loaded.
>
>> I expect Cython and SWIG may be able to manage that through
>> appropriate use of #ifdef's in the generated code, but doing it by
>> hand is likely to be painful, hence the potential benefits of just
>> sticking with single-phase initialisation for the time being.
>
> Yes, code generators are in a position to create two versions of the
> module, and select one using using #ifdef.
>
> The example in the PEP is helpful for other reasons than encouraging
> #ifdef: it shows what needs to change when porting. Think of it as a diff :)

It may be worth being more clear about that.

>
>>>> [snip]
>>>>
>>>> Subinterpreters and Interpreter Reloading
>>>> -----------------------------------------
>>>>
>>>> Extensions using the new initialization scheme are expected to support
>>>> subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly.
>>>
>>> Presumably this support is explicitly and completely defined in the
>>> subsequent sentences.  Is it really just keeping "hidden" module state
>>> encapsulated on the module object?  If not then it may make sense to
>>> enumerate the requirements better for the sake of extension module
>>> authors.
>
> It is explained in the docs, see "Bugs and caveats" here:
> https://docs.python.org/3/c-api/init.html#sub-interpreter-support
> I'll add a link to that page.

Cool.

>
>> I'd actually like to have a better way of doing scenario testing for
>> extension modules (subinterpreters, multiple initialize/finalize
>> cycles, freezing), but I'm not sure this PEP is the best place to
>> define that. Perhaps we could do a PyPI project that was a tox-based
>> test battery for this kind of thing?
>
> I think that's the wrong place to start. Currently, sub-interpreter
> support is buried away in a docs chapter about Python
> initialization/finalization, so a typical extension author won't even
> notice it. We need to first make it *possible* to support
> subinterpreters easily and correctly (so that Cython can do it), and to
> document it prominently in the "writing extensions" part of the docs,
> not only in "extending Python". Then,
> This PEP does part of the first step, and the docs for it (which aren't
> written yet) will do the second step.
> After that, it could make sense to provide a tool for testing this.

There's nothing about the docs that precludes putting testing helpers
up on PyPI though.  However, I'm definitely +1 on improving the docs.

>
>>>> The mechanism is designed to make this easy, but care is still required
>>>> on the part of the extension author.
>>>> No user-defined functions, methods, or instances may leak to different
>>>> interpreters.
>>>> To achieve this, all module-level state should be kept in either the module
>>>> dict, or in the module object's storage reachable by PyModule_GetState.
>>>
>>> Is this programmatically enforceable?
>
> No. (I believe you could even prove this formally.)
>
>>> Is there any mechanism for easily copying module state?
>
> No. This would be impossible to provide in the general case. It's the
> responsibility of your C code.
> That said, if you need to copy module state, chances are your design
> could use some rethinking.
>
>>> How about sharing some state between subinterpreters?
>
> The PyCapsule API was designed for this.

I'm simply thinking in terms of the options we have for a PEP I'm
working on that will facilitate passing objects between
subinterpreters and even possibly sharing some state between them.
Currently it will be practically necessary to exclude extension
modules from any such mechanism.  So I was wondering if there would be
a way to allow extension module authors to define how at least some of
the module's data could be shared between subinterpreters.

>
>>> How much room is there for letting extension module
>>> authors define how their module behaves across multiple interpreters
>>> or across multiple Initialize/Finalize cycles?
>
> Technically, you have all the freedom you want. But if I embed Python
> into my project/library, I'd want multiple sub-interpreters completely
> isolated by default. If I use two libraries that each embed Python into
> my app, I definitely want them isolated.
> So the PEP tries to make it easy to keep multiple interpreters isolated.

As I just noted, I'm looking at making use of subinterpreters for a
different use case where it *does* make sense to effectively share
objects between them.

  [snip]
>>> At the same time, one may argue that reloading modules is not
>>> something to encourage. :)
>>
>> There's a reason import_fresh_module has never made it out of test.support :)
>
> Right. Implementation-wise, it would actually be much easier to support
> reload rather than make it a no-op. But then C module authors would need
> to think about this edge case, which might be tricky to get right, would
> not be likely to get test coverage, and is generally not useful anyway, .
>
> If it turns out to be useful, it would be very simple to add an explicit
> reload slot in the future.

Agreed.

  [snip]
>> This section is missing any explanation of the impact on
>> Python/import.c, on the _imp/imp module, and on the 3 finders/loaders
>> in Lib/importlib/_bootstrap[_external].py (builtin/frozen/extension).
>
> I'll add a summary.
>
> The internal _imp module will have backwards incompatible changes --
> functions will be added and removed as necessary. That's what the
> underscore means :)

Be careful with that assumption.  We've had plenty of experiences
where the assumption because unreliable.

> The deprecated imp module will get a backwards compatibility shim for
> anything it imported from _imp that got removed.
> importlib will stay backwards compatible.
>
> Python/import.c and Python/importdl.* will be rewritten entirely.
> See the patches (linked from the PEP) for details.
>

-eric