[Python-Dev] Pre-PEP: Redesigning extension modules

Fri Aug 23 10:50:18 CEST 2013

Hi,

this has been subject to a couple of threads on python-dev already, for
example:

http://thread.gmane.org/gmane.comp.python.devel/135764/focus=140986

http://thread.gmane.org/gmane.comp.python.devel/141037/focus=141046

It originally came out of issues 13429 and 16392.

http://bugs.python.org/issue13429

http://bugs.python.org/issue16392

Here's an initial attempt at a PEP for it. It is based on the (unfinished)
ModuleSpec PEP, which is being discussed on the import-sig mailing list.

http://mail.python.org/pipermail/import-sig/2013-August/000688.html

Stefan

PEP: 4XX
Title: Redesigning extension modules
Version: $Revision$
Last-Modified: $Date$
Author: Stefan Behnel <stefan_ml at behnel.de>
BDFL-Delegate: ???
Discussions-To: ???
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 11-Aug-2013
Python-Version: 3.4
Post-History: 23-Aug-2013
Resolution:

Abstract
========

This PEP proposes a redesign of the way in which extension modules interact
with the interpreter runtime. This was last revised for Python 3.0 in PEP
3121, but did not solve all problems at the time. The goal is to solve them
by bringing extension modules closer to the way Python modules behave.

An implication of this PEP is that extension modules can use arbitrary
types for their module implementation and are no longer restricted to
types.ModuleType. This makes it easy to support properties at the module
level and to safely store arbitrary global state in the module that is
covered by normal garbage collection and supports reloading and
sub-interpreters.

Motivation
==========

Python modules and extension modules are not being set up in the same way.
For Python modules, the module is created and set up first, then the module
code is being executed. For extensions, i.e. shared libraries, the module
init function is executed straight away and does both the creation and
initialisation. This means that it knows neither the __file__ it is being
loaded from nor its package (i.e. its fully qualified module name, FQMN).
This hinders relative imports and resource loading. In Py3, it's also not
being added to sys.modules, which means that a (potentially transitive)
re-import of the module will really try to reimport it and thus run into an
infinite loop when it executes the module init function again. And without
the FQMN, it is not trivial to correctly add the module to sys.modules either.

This is specifically a problem for Cython generated modules, for which it's
not uncommon that the module init code has the same level of complexity as
that of any 'regular' Python module. Also, the lack of a FQMN and correct
file path hinders the compilation of __init__.py modules, i.e. packages,
especially when relative imports are being used at module init time.

Furthermore, the majority of currently existing extension modules has
problems with sub-interpreter support and/or reloading and it is neither
easy nor efficient with the current infrastructure to support these
features. This PEP also addresses these issues.

The current process
===================

Currently, extension modules export an initialisation function named
"PyInit_modulename", named after the file name of the shared library. This
function is executed by the import machinery and must return either NULL in
the case of an exception, or a fully initialised module object. The
function receives no arguments, so it has no way of knowing about its
import context.

During its execution, the module init function creates a module object
based on a PyModuleDef struct. It then continues to initialise it by adding
attributes to the module dict, creating types, etc.

In the back, the shared library loader keeps a note of the fully qualified
module name of the last module that it loaded, and when a module gets
created that has a matching name, this global variable is used to determine
the FQMN of the module object. This is not entirely safe as it relies on
the module init function creating its own module object first, but this
assumption usually holds in practice.

The main problem in this process is the missing support for passing state
into the module init function, and for safely passing state through to the
module creation code.

The proposal
============

The current extension module initialisation will be deprecated in favour of
a new initialisation scheme. Since the current scheme will continue to be
available, existing code will continue to work unchanged, including binary
compatibility.

Extension modules that support the new initialisation scheme must export a
new public symbol "PyModuleCreate_modulename", where "modulename" is the
name of the shared library. This mimics the previous naming convention for
the "PyInit_modulename" function.

This symbol must resolve to a C function with the following signature::

    PyObject* (*PyModuleTypeCreateFunction)(PyObject* module_spec)

The "module_spec" argument receives a "ModuleSpec" instance, as defined in
PEP 4XX (FIXME). (All names are obviously up for debate and bike-shedding
at this point.)

When called, this function must create and return a type object, either a
Python class or an extension type that is allocated on the heap. This type
will be instantiated as module instance by the importer.

There is no requirement for this type to be exactly or a subtype of
types.ModuleType. Any type can be returned. This follows the current
support for allowing arbitrary objects in sys.modules and makes it easier
for extension modules to define a type that exactly matches their needs for
holding module state.

The constructor of this type must have the following signature::

    def __init__(self, module_spec):

The "module_spec" argument receives the same object as the one passed into
the module type creation function.

Implementation
==============

XXX - not started

Reloading and Sub-Interpreters
==============================

To "reload" an extension module, the module create function is executed
again and returns a new module type. This type is then instantiated as by
the original module loader and replaces the previous entry in sys.modules.
Once the last references to the previous module and its type are gone, both
will be subject to normal garbage collection.

Sub-interpreter support is an inherent property of the design. During
import in the sub-interpreter, the module create function is executed
and returns a new module type that is local to the sub-interpreter. Both
the type and its module instance are subject to garbage collection in the
sub-interpreter.

Open questions
==============

It is not immediately obvious how extensions should be handled that want to
register more than one module in their module init function, e.g. compiled
packages. One possibility would be to leave the setup to the user, who
would have to know all FQMNs anyway in this case (or could construct them
from the module spec of the current module), although not the import
file path. A C-API could be provided to register new module types in the
current interpreter, given a user provided ModuleSpec.

There is no inherent requirement for the module creation function to
actually return a type. It could return a arbitrary callable that creates a
'modulish' object when called. Should there be a type check in place that
makes sure that what it returns is a type? I don't currently see a need for
this.

Copyright
=========

This document has been placed in the public domain.