[Import-SIG] Round 2 for "A ModuleSpec Type for the Import System"

Sun Aug 11 15:03:00 CEST 2013

I think this is solid enough to be worth adding to the PEPs repo now.

On 9 August 2013 18:58, Eric Snow <ericsnowcurrently at gmail.com> wrote:
> Here's an updated version of the PEP for ModuleSpec which addresses the
> feedback I've gotten.  Thanks for the help.  The big open question, to me,
> is whether or not to have a separate reload() method.  I'll be looking into
> that when I get a chance.  There's also the question of a path-based
> subclass, but I'm currently not convinced it's worth it.

One piece of feedback from me (triggered by the C extension modules
discussion on python-dev): we should consider proposing a new "exec"
hook for C extension modules that could be defined instead of or in
addition to the existing PEP 3121 init hook.

Extension modules that don't rely on mutable static variables or the
PEP 3121 per-interpreter state APIs could just define the new exec
hook and get a new module instance every time they're imported. Those
that do have per-interpreter state would still get an opportunity to
run additional code after all the magic attributes have been set.

Also, to handle the extension module case, we may need to let loaders
define an optional "create_module" method that accepts the MethodSpec
object as an argument. The extension module loader would implement
this as handling the PyInit_<modulename> call. (Setting the magic
attributes according to the spec would happen automatically after the
call, so each loader wouldn't need to implement that part)

(Note: once I get back to Australia around the 22nd, I should have
time to help out more directly with this)

> -----------------------------------

> Firstly, any time the import system needs to save information about a
> module we end up with more attributes on module objects that are
> generally only meaningful to the import system and occoasionally to some

Typo: occoasionally

> people.  It would be nice to have a per-module namespace to put future
> import-related information.  Secondly, there's an API void between
> finders and loaders that causes undue complexity when encountered.
>
> Finders are strictly responsible for providing the loader which the

"are currently responsible" (since the PEP is about changing the
responsibiity of finders, this is a little unclear at present)

> Specification
> =============
>
> The goal is to address the gap between finders and loaders while
> changing as little of their semantics as possible.  Though some
> functionality and information is moved the new ``ModuleSpec`` type,

"moved to the new"

> their semantics should remain the same.  However, for the sake of
> clarity, those semantics will be explicitly identified.
>
> A High-Level View
> -----------------
>
> ...

Not sure a high level view is needed, but you can fill this in if you want :)

>
> ModuleSpec
> ----------
>
> A new class which defines the import-related values to use when loading
> the module.  It closely corresponds to the import-related attributes of
> module objects.  ``ModuleSpec`` objects may also be used by finders and
> loaders and other import-related APIs to hold extra import-related
> state about the module.  This greatly reduces the need to add any new
> new import-related attributes to module objects, and loader ``__init__``
> methods won't need to accommodate such per-module state.

To avoid conflicts as the spec attributes evolve in the future, would
it be worth having a "custom" field which is just an arbitrary object
reference used to pass info from the finder to the loader without
troubling the rest of the import system?

> Creating a ModuleSpec:
>
> ``ModuleSpec(name, loader, *, origin=None, filename=None, cached=None,
> path=None)``
>
> The parameters have the same meaning as the attributes described below.
> However, not all ``ModuleSpec`` attributes are also parameters.
>  The
> passed values are set as-is.  For calculated values use the
> ``from_loader()`` method.

This paragraph isn't particularly clear. Perhaps:

"Passed in parameter values are assigned directly to the corresponding
attributes below. Other attributes not listed as parameters (such as
``package``) are read-only properties that are automatically derived
from these values.

The ``ModuleSpec.from_loader()`` class method allows a suitable
ModuleSpec instance to be easily created from a PEP 302 loader object"

> ModuleSpec Attributes
> ---------------------
>
> Each of the following names is an attribute on ``ModuleSpec`` objects.
> A value of ``None`` indicates "not set".  This contrasts with module
> objects where the attribute simply doesn't exist.
>
> While ``package`` and ``is_package`` are read-only properties, the
> remaining attributes can be replaced after the module spec is created
> and after import is complete.  This allows for unusual cases where
> modifying the spec is the best option.  However, typical use should not
> involve changing the state of a module's spec.

I'm with Brett that "is_package" should go, to be replaced by
"spec.path is not None" wherever it matters. is_package() would then
fall through to the PEP 302 loader API via __getattr__.

> ``package``
>
> The name of the module's parent.  This is a dynamic attribute with a
> value derived from ``name`` and ``is_package``.  For packages it is the
> value of ``name``.  Otherwise it is equivalent to
> ``name.rpartition('.')[0]``.  Consequently, a top-level module will have
> give the empty string for ``package``.

s/give//

> ``is_package``
>
> Whether or not the module is a package.  This dynamic attribute is True
> if ``path`` is set (even if empty), else it is false.

As above (i.e. don't use it)

> ``origin``
>
> A string for the location from which the module originates.  If
> ``filename`` is set, ``origin`` should be set to the same value unless
> some other value is more appropriate.  ``origin`` is used in
> ``module_repr()`` if it does not match the value of ``filename``.
>
> Using ``filename`` for this meaning would be inaccurate, since not all
> modules have path-based locations.  For instance, built-in modules do
> not have ``__file__`` set.  Yet it is useful to have a descriptive
> string indicating that it originated from the interpreter as a built-in
> module.  So built-in modules will have ``origin`` set to ``"built-in"``.

How about we *just* have origin, with a separate "set_fileattr"
attribute to indicate "this is a discrete file, you should set
__file__"?

Also, we should explicitly note that we'll still set __file__ for zip
imports, due to backwards compatibility concerns, even though it
doesn't correspond to a valid filesystem path.

(Random thought: spec.origin + spec.cached + a cache directory setting
in zipimport would give a potentially clean way to do extension module
imports from zip archives)

> ``path``
>
> The list of path entries in which to search for submodules if this
> module is a package.  Otherwise it is ``None``.

Path entries don't have to correspond to filesystem locations - they
just have to make sense to at least one path hook
(e.g. a DB URI would be a valid path entry).

> .. XXX add a path-based subclass?

Nope :)

> ModuleSpec Methods
> ------------------
>
> ``from_loader(name, loader, *, is_package=None, origin=None, filename=None,
> cached=None, path=None)``
>
> .. XXX use a different name?

I'd disallow customisation on this one - if people want to customise,
they should just query the PEP 302 APIs themselves and call the
ModuleSpec constructor directly. The use case for this one should be
to make it trivial to switch from "return loader" to "return
ModuleSpec.from_loader(loader)" in a find_module implementation.

> In contrast to ``ModuleSpec.__init__()``, which takes the arguments
> as-is, ``from_loader()`` calculates missing values from the ones passed
> in, as much as possible.  This replaces the behavior that is currently
> provided the several ``importlib.util`` functions as well as the
> optional ``init_module_attrs()`` method of loaders.  Just to be clear,
> here is a more detailed description of those calculations::
>
>    If not passed in, ``filename`` is to the result of calling the
>    loader's ``get_filename()``, if available.  Otherwise it stays
>    unset (``None``).
>
>    If not passed in, ``path`` is set to an empty list if
>    ``is_package`` is true.  Then the directory from ``filename`` is
>    appended to it, if possible.  If ``is_package`` is false, ``path``
>    stays unset.
>
>    If ``cached`` is not passed in and ``filename`` is passed in,
>    ``cached`` is derived from it.  For filenames with a source suffix,
>    it set to the result of calling
>    ``importlib.util.cache_from_source()``.  For bytecode suffixes (e.g.
>    ``.pyc``), ``cached`` is set to the value of ``filename``.  If
>    ``filename`` is not passed in or ``cache_from_source()`` raises
>    ``NotImplementedError``, ``cached`` stays unset.
>
>    If not passed in, ``origin`` is set to ``filename``.  Thus if
>    ``filename`` is unset, ``origin`` stays unset.

Hmm, is there a reason this can't be the default constructor
behaviour? What's the value of *not* having the sensible fallbacks,
given they can always be overridden by passing in explicit values when
you want something different?

A separate "from_module(m)" constructor would probably make sense, though.

> ``module_repr()``
>
> Returns a repr string for the module if ``origin`` is set and
> ``filename`` is not set.  The string refers to the value of ``origin``.
> Otherwise ``module_repr()`` returns None.  This indicates to the module
> type's ``__repr__()`` that it should fall back to the default repr.
>
> We could also have ``module_repr()`` produce the repr for the case where
> ``filename`` is set or where ``origin`` is not set, mirroring the repr
> that the module type produces directly.  However, the repr string is
> derived from the import-related module attributes, which might be out of
> sync with the spec.
>
> .. XXX Is using the spec close enough?  Probably not.

I think it makes sense to always return the expected repr based on the
spec attributes, but allow a custom origin to be passed in to handle
the case where the module __file__ attribute differs from
__spec__.origin (keeping in mind I think __spec__.filename should be
replaced with __spec__.set_fileattr)

> The implementation of the module type's ``__repr__()`` will change to
> accommodate this PEP.  However, the current functionality will remain to
> handle the case where a module does not have a ``__spec__`` attribute.

Experience tells us that the import system should ensure the __spec__
attribute always exists (even if it has to be filled in from the
module attributes after calling load_module)

> ``load(module=None, *, is_reload=False)``

Yep, definitely needs to be a separate method. "is_reload" would
almost always be set to a boolean, which means a separate API is
likely to be better.

However, I think the separate method should be "exec()" rather than
"reload()" and require that the module always be passed in.

We could also expose a "create" method that just creates and returns
the new module object, and replace importlib.util.module_to_load with
a context manager that accepted the module as a parameter. Say
"add_to_sys", which fails if the module is already present in
sys.modules.

load() would then look something like:

    def load(self):
        m = self.create()
        with importlib.util.add_to_sys(m):
            self.exec(m)
        return sys.modules[self.name]

We could also provide reload() if we wanted to:

    def reload(self):
        self.exec(sys.modules[self.name])
        return sys.modules[self.name]

> Subclassing
> -----------
>
> Subclasses of ModuleSpec are allowed, but should not be necessary.
> Adding functionality to a custom finder or loader will likely be a
> better fit and should be tried first.  However, as long as a subclass
> still fulfills the requirements of the import system, objects of that
> type are completely fine as the return value of ``find_module()``.

We may need to do subclasses for the ABC registration backwards
compatibility hack.

>
> Module Objects
> --------------
>
> Module objects will now have a ``__spec__`` attribute to which the
> module's spec will be bound.  None of the other import-related module
> attributes will be changed or deprecated, though some of them could be;
> any such deprecation can wait until Python 4.
>
> ``ModuleSpec`` objects will not be kept in sync with the corresponding
> module object's import-related attributes.  Though they may differ, in
> practice they will typically be the same.

Worth mentioning that __main__.__spec__.name will give the real name
of module's executed with -m here rather than delaying that until the
notes at the end.

> Finders
> -------
>
> Finders will now return ModuleSpec objects when ``find_module()`` is
> called rather than loaders.  For backward compatility, ``Modulespec``
> objects proxy the attributes of their ``loader`` attribute.
>
> Adding another similar method to avoid backward-compatibility issues
> is undersireable if avoidable.  The import APIs have suffered enough,
> especially considering ``PathEntryFinder.find_loader()`` was just
> added in Python 3.3.  The approach taken by this PEP should be
> sufficient to address backward-compatibility issues for
> ``find_module()``.
>
> The change to ``find_module()`` applies to both ``MetaPathFinder`` and
> ``PathEntryFinder``.  ``PathEntryFinder.find_loader()`` will be
> deprecated and, for backward compatibility, implicitly special-cased if
> the method exists on a finder.

Actually, we don't currently have anything on ModuleSpec to indicate
"this is complete, stop scanning for more path fragments" or how we
will compose multiple module specs for the individual fragments into a
combined spec for the namespace package.

> Finders are still responsible for creating the loader.  That loader will
> now be stored in the module spec returned by ``find_module()`` rather
> than returned directly.  As is currently the case without the PEP, if a
> loader would be costly to create, that loader can be designed to defer
> the cost until later.
>
> Loaders
> -------
>
> Loaders will have a new method, ``exec_module(module)``.  Its only job
> is to "exec" the module and consequently populate the module's
> namespace.  It is not responsible for creating or preparing the module
> object, nor for any cleanup afterward.  It has no return value.
>
> The ``load_module()`` of loaders will still work and be an active part
> of the loader API.  It is still useful for cases where the default
> module creation/prepartion/cleanup is not appropriate for the loader.
>
> For example, the C API for extension modules only supports the full
> control of ``load_module()``.  As such, ``ExtensionFileLoader`` will not
> implement ``exec_module()``.  In the future it may be appropriate to
> produce a second C API that would support an ``exec_module()``
> implementation for ``ExtensionFileLoader``.  Such a change is outside
> the scope of this PEP.

As above, I think it may worth tackling this. It shouldn't be *that*
hard given the higher level changes and will solve some hard problems
at the lower level.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia