[Import-SIG] PEP 489: Multi-phase extension module initialization; version 5

Wed May 20 12:55:37 CEST 2015

On 05/20/2015 02:22 AM, Eric Snow wrote:
> On Tue, May 19, 2015 at 5:06 AM, Petr Viktorin <encukou at gmail.com> wrote:
>> On 05/19/2015 05:51 AM, Nick Coghlan wrote:
>>> On 19 May 2015 at 10:07, Eric Snow <ericsnowcurrently at gmail.com> wrote:
>>>> On Mon, May 18, 2015 at 8:02 AM, Petr Viktorin <encukou at gmail.com> wrote:

[snip]
>>>>>
>>>>> If PyModuleExec replaces the module's entry in sys.modules,
>>>>> the new object will be used and returned by importlib machinery.
>>>>
>>>> Just to be sure, something like "mod = sys.modules[modname]" is done
>>>> before each execution slot.  In other words, the result of the
>>>> previous execution slot should be used for the next one.
>>>
>>> That's not the original intent of this paragraph - rather, it is
>>> referring to the existing behaviour of the import machinery.
>>>
>>> However, I agree that now we're allowing the Py_mod_exec slot to be
>>> supplied multiple times, we should also be updating the module
>>> reference between slot invocations.
>>
>> No, that won't work. It's possible (via direct calls to the import
>> machinery) to load a module without adding it to sys.modules.
> 
> What direct calls do you mean?  I would not expect any such mechanism
> to work properly with extension modules.

Reimplement
<https://www.python.org/dev/peps/pep-0451/#how-loading-will-work>
without the sys.modules parts.
The point is that exec_module doesn't a priori depend on the module
being in sys.modules, which I think is a good thing.

>> The behavior should be clear (when you think about it) after I include
>> the loader implementation pseudocode.
> 
> Okay.
> 
>>
>>> I also think the PEP could do with a brief mention of the additional
>>> modularity this approach brings at the C level - rather than having to
>>> jam everything into one function, an extension module can easily break
>>> up its initialisation into multiple steps, and its technically even
>>> possible to share common steps between different modules.
>>
>> Eh, I think it's better to create one function that calls the parts,
>> which was always possible, and works just as well.
>> Repeating slots is allowed because it would be an unnecessary bother to
>> check for duplicates. It's not a feature to advertise, the PEP just
>> specifies that in the weird edge case, the intuitive thing will happen.
> 
> Be that as it may, I think it would be a mistake to treat support for
> multiple exec slots as a second-class citizen in the design.
> Personally I find it an appealing feature.

It's there, but I'll not not advertise it too much in the docs.

>> (I did have a useful future use case for repeated slots, but the current
>> PEP allows a better and more obvious solution so I'll not even mention
>> it again.)
>>
>> Still, the steps are processed in a loop from a single function
>> (PyModule_ExecDef), and that function operates on a module object -- it
>> doesn't know about sys.modules and can't easily check if you replaced
>> the module somewhere.
> 
> I would consider this approach to be a mistake as well.  The approach
> should stay consistent with the semantics of the whole import system,
> where sys.modules is checked directly.  Unfortunately, that ship has
> already sailed.

It's the loader that checks sys.modules, *after* exec_module is called.
No other implementation of exec_module checks sys.modules in the middle
of its operation. So I think the semantics are consistent.

[snip]
>>>>>
>>>>> Modules that need to work unchanged on older versions of Python should not
>>>>> use multi-phase initialization, because the benefits it brings can't be
>>>>> back-ported.
>>>>
>>>> Given your example below, "should not" seems a bit strong to me.  In
>>>> fact, what are the objections to encouraging the approach from the
>>>> example?
>>>
>>> Agreed, "should not" is probably too strong here. On the other hand,
>>> preserving compatibility with older Python versions in a module that
>>> has been updated to rely on multi-phase initialization is likely to be
>>> a matter of "graceful degradation", rather than being able to
>>> reproduce comparable functionality (which I believe may have been the
>>> point Petr was trying to convey).
>>
>> My point is that if you need graceful degradation, your best bet is to
>> stick with single-phase init. Then you'll have one code path that works
>> the same on all versions.
>> If you *need* the features of multi-phase init, you need to remove
>> support for Pythons that don't have it.
>> If you need both backwards compatibility and multi-phase init, you
>> essentially need to create two modules (with shared contents), and make
>> sure they end up in the same state after they're loaded.
>>
>>> I expect Cython and SWIG may be able to manage that through
>>> appropriate use of #ifdef's in the generated code, but doing it by
>>> hand is likely to be painful, hence the potential benefits of just
>>> sticking with single-phase initialisation for the time being.
>>
>> Yes, code generators are in a position to create two versions of the
>> module, and select one using using #ifdef.
>>
>> The example in the PEP is helpful for other reasons than encouraging
>> #ifdef: it shows what needs to change when porting. Think of it as a diff :)
> 
> It may be worth being more clear about that.

OK

[snip]
>>>>> The mechanism is designed to make this easy, but care is still required
>>>>> on the part of the extension author.
>>>>> No user-defined functions, methods, or instances may leak to different
>>>>> interpreters.
>>>>> To achieve this, all module-level state should be kept in either the module
>>>>> dict, or in the module object's storage reachable by PyModule_GetState.
>>>>
>>>> Is this programmatically enforceable?
>>
>> No. (I believe you could even prove this formally.)
>>
>>>> Is there any mechanism for easily copying module state?
>>
>> No. This would be impossible to provide in the general case. It's the
>> responsibility of your C code.
>> That said, if you need to copy module state, chances are your design
>> could use some rethinking.
>>
>>>> How about sharing some state between subinterpreters?
>>
>> The PyCapsule API was designed for this.
> 
> I'm simply thinking in terms of the options we have for a PEP I'm
> working on that will facilitate passing objects between
> subinterpreters and even possibly sharing some state between them.
> Currently it will be practically necessary to exclude extension
> modules from any such mechanism.  So I was wondering if there would be
> a way to allow extension module authors to define how at least some of
> the module's data could be shared between subinterpreters.

You should be able to put that info in slots. It's hard to speculate
without knowing specifics, though.

>>>> How much room is there for letting extension module
>>>> authors define how their module behaves across multiple interpreters
>>>> or across multiple Initialize/Finalize cycles?
>>
>> Technically, you have all the freedom you want. But if I embed Python
>> into my project/library, I'd want multiple sub-interpreters completely
>> isolated by default. If I use two libraries that each embed Python into
>> my app, I definitely want them isolated.
>> So the PEP tries to make it easy to keep multiple interpreters isolated.
> 
> As I just noted, I'm looking at making use of subinterpreters for a
> different use case where it *does* make sense to effectively share
> objects between them.

OK. This PEP isn't designed for that, but it should offer enough
extensibility.

[snip]
>>> This section is missing any explanation of the impact on
>>> Python/import.c, on the _imp/imp module, and on the 3 finders/loaders
>>> in Lib/importlib/_bootstrap[_external].py (builtin/frozen/extension).
>>
>> I'll add a summary.
>>
>> The internal _imp module will have backwards incompatible changes --
>> functions will be added and removed as necessary. That's what the
>> underscore means :)
> 
> Be careful with that assumption.  We've had plenty of experiences
> where the assumption because unreliable.

That's why I provide backcompat shims for undocumented, deprecated
functions in "imp". But _imp is just too low-level to do that easily.