[Import-SIG] PEP 489: Redesigning extension module loading

Tue Mar 24 17:34:34 CET 2015

I'll share my notes on an API with PEP 384-style slots, before 
attempting to write it out in PEP language.

I struggled to find a good name for the "PyType_Spec" equivalent, since 
ModuleDef and ModuleSpec are both taken, but then I realized that, if 
the docstring is put in a slot, I just need an array of slots...

Does the following look reasonable?

in moduleobject.h:

typedef struct PyModule_Slot{
     int slot;
     void *pfunc;
} PyModuleDesc_Slot;

typedef struct PyModule_StateDef {
     int size;
     traverseproc m_traverse;
     inquiry m_clear;
     freefunc m_free;
}

#define Py_m_doc 1
#define Py_m_create 2
#define Py_m_methods 3
#define Py_m_statedef 4
#define Py_m_exec 5

in the extension:

static PyMethodDef spam_methods[] = {
     {"demo", (PyCFunction)spam_demo,  ...},
     {NULL, NULL}
};

static PyModule_StateDef spam_statedef[] = {
     sizeof(spam_state_t),
     spam_state_traverse,
     spam_state_clear,
     spam_state_free
     /* any of those three can be NULL if not needed */
}

static PyModule_Slot spam_slots[] = {
     {Py_m_doc, PyDoc_STR("A spammy module")},
     {Py_m_methods, spam_methods},
     {Py_m_statedef, spam_statedef},
     {Py_m_exec, spam_exec},
     {0, NULL}
}

PyModuleDesc *PyModuleInit_spam {
     return spam_slots;
}

There is both a Create and Exec slot, among others – anyone can choose 
what they need.

If you set the Py_m_create slot, then you can't also set Py_m_state. All 
the other items are honored (including name and doc, which will be set 
by the module machinery – but name might not match).

The exec method is tied to the module; it's only called on modules 
created from the description (or ones that look as if they were, in 
runpy's case).
It is called only once for each module; reload()ing an extension module 
will only reset import-related attributes (as it does now).

If you don't set Py_m_create, you'll be able to run the module with 
python -m.

For non-ASCII module names: the X in PyModuleGetDesc_X will be in 
punycode (s/-/_/), PyModuleDesc.name in UTF-8, and filename in the 
filesystem encoding.

I've thought about supporting multiple modules per extension, but I 
don't see a clear way to do that. The standard ModuleSpec machinery 
assumes one module per file, and it's not straightforward to get around 
that. To load more modules from an extension, you'd need a custom finder 
or loader anyway. So I'm going to implement helpers needed to load a 
module given an arbitrary PyModuleDesc, and leave implementing multi-mod 
support to people who need it for now.
So, an "inittab" is out for now.

Perhaps a slot for automatically adding classes (from array of 
PyType_Spec) would help PyType_Spec adoption.
And then a slot adding string/int/... constants from arrays of 
name/value would mean most modules wouldn't need an exec function.
And an "inittab" slot should be possible for package-style extensions.
I'll leave these ideas out for now, but possibilities for extending are 
there.

On 03/21/2015 06:38 PM, Stefan Behnel wrote:
> Petr Viktorin schrieb am 21.03.2015 um 11:30:
>> It would be nice to extend runpy to handle Create+Exec modules. If this can
>> be pulled off, there'd be no need for Exec-only modules except the
>> convenience.
>>
>> * module reloading is useless for extension modules – a changed version
>> version can't be read from the disk, and correct reload behavior is another
>> corner case for authors to think about
>
> I think even shared library reloading could be achieved by using a filename
> scheme like "modulename-HASH.so" with a SHA hash of the source file or so,
> if the original module name is used to run the right module init function(s).
>
> The files would pile up in memory, though (there's usually no "dynamic
> unlinking"), so it's not a feature for production. I generally agree that
> there is little enough of a use case for reloading that it can safely be
> ignored.
>
>
>> One thing I'm not clear about: what are the advantages of a module subclass
>> over a normal module with m_size>0?
>
> Properties and methods. In fact, you should rather ask why module objects
> have to be special in the first place.
>
> My initial idea was to implement *only* an extension type in extension
> modules, and have the library loader instantiate that. It would simply pass
> the module spec as constructor argument. However, Nick convinced me at the
> time that that's a) too inflexible and b) too cumbersome for manually
> written code. That eventually brought up the idea of splitting the
> initialisation into Create+Exec.
>
>
>>> I thought Brett actually implemented multi-module extension support a
>>> while back (which this PEP would then inherit), but I can't find any
>>> current evidence of that change, so either my recollection is wrong,
>>> or my search skills are failing me :)
>>
>> It's there, grep issue16421.
>
> Thanks. I didn't know about it.
>
>
>> Separating Create and Exec has these effects:
>> - Allowing you to implement just one and leave the rest to default
>> machinery. This is good.
>> - Allowing some time to pass between Create and Exec is called. This might
>> be useful for lazy loading, I guess.
>> - Allowing the loader or third-party code to modify the object between
>> Create and Exec is called. This is dangerous (for consenting adults who
>> don't mind the occasional segfault).
>
> Depends on what they do with the object. Setting attributes on it should be
> ok, for example. In fact, I would like to leave it to CPython to set
> attributes like "__name__" and "__file__" on it, because that simplifies
> the implementation of a Create function. From time to time, the module
> interface is extended with new attributes, so setting them externally
> avoids the need to adapt the user code each time.
>
> However, an API helper function could be provided that copies attributes
> from the module spec to the 'module' object. Calling that is simple enough,
> and it would leave the responsibility for the evolution of the "standard
> module API" in CPython.
>
>
>> - Allowing Exec to be called multiple times after Create, i.e. module
>> reloading. I don't think there is a use case (and for module-specific cases
>> it can be done in a separately exported function).
>> - Allowing Exec without the corresponding Create, i.e. loading into
>> arbitrary objects. This is cool, and it mimics what source modules can do,
>> but I'm less and less convinced that it's actually useful.
>>
>> It's a lot to think about if you want to design a module that behaves
>> correctly, and for some combinations it's not clear what "correctly" means.
>
> I agree. I think we can leave out these two "features".
>
>
>>> The API design for defining types through the stable ABI
>>> (https://www.python.org/dev/peps/pep-0384/#type-objects), which was
>>> designed with the benefit of years of experience with the old
>>> approach, is much nicer, as the NULL-terminated list of named slots
>>> lets you only worry about the slots you care about, and the
>>> interpreter takes care of everything else.
>>
>> Well, if we end up needing to extend PyModuleDef, let's use slots.
>
> That means we have to enable support for that now. And we have to integrate
> it with the way to provide the PyModuleDef in the first place (note that
> extending PyModuleDef itself is not an option due to the stable ABI).
> Meaning, users who don't want to provide a Create function will still have
> to deal with the (empty) slots, and everyone else will currently have to
> provide a one-slot "create" entry.
>
> I'm not saying it's a bad idea, but it might not be a good one either.
>
>
>> Another possible extension is hooks for resources. Imagine using Cython
>> like zipapp, to pack an entire app including extensions into one file.
>
> This can already be done. Note that there is no actual need for a native
> module to be called by "python -m". You can also just add a C main()
> function and start up an embedded CPython runtime in it. Cython can already
> generate this main() function for you.
>
> However, being able to "python -m" a native module (or package) would be
> nice for consistency and also support running it from the PYTHONPATH, which
> is a major convenience feature.
>
>
>>> With the current design of PEP 489, the idea is that if you don't
>>> really care about the module object, you just define Exec, and the
>>> interpreter gives you a standard Python level module object. All your
>>> global state still gets stored as Python objects, and you just get the
>>> "C execution model with the Python data model" development experience
>>> which is actually quite a nice environment to program in.
>>>
>>> However, if you want straighforward access to the C *data* model at
>>> runtime as well as its execution model, then you can define Create and
>>> use the existing PyModule_Create APIs, or (as a new feature) a custom
>>> module subclass or a completely custom type, to define how your module
>>> state is stored.
>>
>> The problem is that to add C data, you'd either need to define an whole
>> extra hook, or jump through inefficient PyCapsule hoops on every access. I
>> worry that module authors will just take the path of least resistance, and
>> use static data. I think it's substantially better to say "use
>> sizeof(mydata) instead of 0, and use this fast function/macro to get at
>> your data".
>
> Yes, there should be a fast default way to do that. Otherwise, people will
> just invent their own. The advantage of subinterpreter support and module
> finalisation isn't immediately obvious, the advantage of fast access to
> global state definitely is.
>
>
>>> That two level approach gives you all the same flexibility you have
>>> today by defining a custom Init hook (and more), but also lets you opt
>>> out of learning most of the details of the C data model if all you're
>>> really after is faster low level manipulation of data stored in Python
>>> objects.
>>
>> A module def array additionally gives:
>> - support for non-ASCII module names
>> - a catalog of the modules the extension contains
>> but you can't use custom module subclasses -- unless a create slot is added
>> to the module def. (Or you can replace the sys.modules entry -- I believe
>> the overhead of a wasted empty module object is negligible.)
>
> Yes, I guess it would be. However, the replacement must happen before other
> code might access the module (e.g. by importing it), i.e. right after
> putting it into sys.modules, at the very start of the Exec step.
>
> It does seem feel a hack, though, to design an interface that says "here's
> your module, throw it away if you like, but make sure to clean up what I
> left behind"...
>
> Stefan
>