[Import-SIG] PEP 451 (ModuleSpec) round 3

Thu Aug 29 03:57:40 CEST 2013

On Wed, Aug 28, 2013 at 11:22 AM, Brett Cannon <brett at python.org> wrote:

> On Wed, Aug 28, 2013 at 4:50 AM, Eric Snow <ericsnowcurrently at gmail.com>wrote:
>
>> This PEP proposes to add a new class to ``importlib.machinery`` called
>> ``ModuleSpec``.  It will be authoritative for all the import-related
>> information about a module, and will be available without needing to
>> load the module first.  Finders will provide a module's spec instead of
>> a loader.
>>
>
> Don't you mean finders will return a ModuleSpec? Since 'loader' is still
> defined in the ModuleSpec to know what loader to use that statement that
> finders don't provide a loader is misleading.
>

Yeah, finders will still create the loaders.  I'll make that more clear.

importlib.machinery.ModuleSpec (new)
>>
> ------------------------------------
>>
>
> For this entire section you need to provide the call signatures as you
> start talking semantics later w/o making clear what is being passed and
> returned before going into detail of the individual methods. Otherwise move
> the detailed discussion of the methods up to before the semantics overview.
>

I'll try moving the detailed descriptions up.

>
>
>>
>> Attributes:
>>
>> * name - a string for the name of the module.
>> * loader - the loader to use for loading and for module data.
>> * origin - a string for the location from which the module is loaded.
>>
>
> I would give an "e.g." here to help explain what you mean. As previous
> comments have shown, the name alone is not enough to understand what value
> should go here. =)
>

Good point. :)

>
>
>> * submodule_search_locations - strings for where to find submodules,
>>   if a package.
>> * loading_info - a container of data for use during loading (or None).
>> * cached (property) - a string for where the compiled module will be
>>   stored.
>> * is_location (RO-property) - the module's origin refers to a location.
>>
>> .. XXX Find a better name than loading_info?
>>
>
> loading_data is all that I can think of
>
>
>>  .. XXX Add ``submodules`` (RO-property) - returns possible submodules
>>    relative to spec (or None)?
>>
>
> Actual use-case or are you just guessing there will be a use? Don't add
> any fields that we have not seen an actual need for.
>

I was thinking of what Nick said about downplaying the module/package
distinction, since a package is just a module with possible submodules.  So
then I thought, what if there were an easy way to see what submodules a
module has available?  Non-packages would always have 0 and packages would
have 0 or more.  I'd use that if it existed.

This was mostly a passing idea that would need more thought (the
implementation might be tricky).  I agree it doesn't need to be in the PEP.

>
>> .. XXX Add ``loaded`` (RO-property) - the module in sys.modules, if any?
>>
>
> Too easy to figure out with ``name in sys.modules`` and can go stale
> (unless you make this a property).
>

This was a light attempt at lowering the barrier to entry with regards to
the import system.  I was thinking of how you have to know to look in
sys.modules to see if the module is loaded.  Providing a property
effectively hides sys.modules as an implementation detail.  I was also
thinking of this as "installed".

Regardless, that I left this as a comment reflects my uncertainty of its
utility.

> Python users will be able to inspect a module's ``__spec__`` to get
>> import-related information about the object.  Generally, they will not
>> be using the ``ModuleSpec`` factory methods nor the instance methods.
>>
>
> As of right now no one is using the instance methods based on the wording
> in this section. =)
>

Yeah, that would read better as something like "Generally, Python
applications and interactive users will not...".

>
>> However, each spec has methods named ``create``, ``exec``, ``load``, and
>> ``reload``.  Since they are so easy to access (and misunderstand/abuse),
>> their function and availability require explicit consideration in this
>> proposal.
>>
>>
>> What Will Existing Finders and Loaders Have to Do Differently?
>> ==============================================================
>>
>> Immediately?  Nothing.  The status quo will be deprecated, but will
>> continue working.  However, here are the things that the authors of
>> finders and loaders should change relative to this PEP:
>>
>> * Implement ``find_spec()`` on finders.
>> * Implement ``exec_module()`` on loaders, if possible.
>>
>> The factory methods of ``ModuleSpec`` are intended to be helpful for
>> converting existing finders.  ``from_loader()`` and
>> ``from_file_location()`` are both straight-forward utilities in this
>> regard.
>>
>
> If this holds to be true then they should go into importlib.util and kept
> out of the general object since dir(module_spec) shouldn't need to show the
> methods indefinitely.
>

I've actually been vacillating for days between classmethods and
importlib.util, and at one point I even made a meta class and put the
factories there (to keep them out of instances).  At the moment
importlib.util is making more sense.

>
>
>>  In the case where loaders already expose methods for creating
>> and preparing modules, a finder may use ``ModuleSpec.from_module()`` on
>> a throw-away module to create the appropriate spec.
>>
>
> Why is the module a throw-away one? And why would loaders need to
> construct a ModuleSpec?
>

This is something the Nick pointed out.  Some loaders may already have the
API to create a module and populate its attributes.  In that case, the
finder could use that API to get the module and then use
ModuleSpec.from_module() to create the spec that find_spec() would return.
 This is very explicit and direct way to map the existing import-related
info for the module onto a spec.

>
>
>>
>> As for loaders,
>>
>
> You were just talking about loader, so this is a bad transition.
>

Yeah, that did get muddled.  I was talking about how finders could use the
existing capabilities of loaders to build a spec.  I'll make that less
awkward.

> This is an outline of what happens in ``ModuleSpec.load()``.
>>
>> 1. A new module is created by calling ``spec.create()``.
>>
>>    a. If the loader has a ``create_module()`` method, it gets called.
>>       Otherwise a new module gets created.
>>    b. The import-related module attributes are set.
>>
>
> So it seems step (b) happens even if step (a) does. If that's the case
> then are attributes overridden blindly, or conditionally set? If (b)
> doesn't happen if (a) did then you need to make that clear.
>

Yeah, (b) always happens.  I was planning on having them be overridden
blindly to match the spec.  Loader.create_module() would not be responsible
for setting the import-related attributes.

>
>
>>
>> 2. The module is added to sys.modules.
>>
>
> I would add a note that there is a separate method for handling reloads
> and thus blindly setting sys.modules is acceptable.
>

Good point.  Furthermore, if the module exists in sys.modules when load()
gets called and it fails, the module will be removed from sys.modules.  Do
you think it would make sense to stick the original module back into
sys.modules in that case?  I don't because calling load() may have
side-effects on the state of that original module.

>
>
>> 3. ``spec.exec(module)`` gets called.
>>
>>    a. If the loader has an ``exec_module()`` method, it gets called.
>>       Otherwise ``load_module()`` gets called for backward-compatibility
>>       and the resulting module is updated to match the spec.
>>
>
> "resulting module found in sys.modules is".
>
> And I think you meant to make step (b) be the fallback to load_module().
>

I was thinking of it as one step with two possible paths, rather than 2
steps.

>
>
>>
>> 4. If there were any errors the module is removed from sys.modules.
>> 5. If the module was replaced in sys.modules during ``exec()``, the one
>>    in sys.modules is updated to match the spec.
>>
>
> This doesn't make sense. You just said the module got updated to match the
> spec in step 3.a. Are you saying you're going to overwrite values that
> exec_module() set? And once again, blindly updating or conditionally? And
> how are these attributes being set?
>

As with create_module(), Loader.exec_module() isn't in charge of setting
the import-related attributes (as opposed to Loader.load_module(), which
is).  However, if the module was replaced in sys.modules during
exec_module(), then we assume that the one in sys.modules has not been
touched by the spec.  So we set any of the import-related attributes on it
that aren't already set (respecting the ones that are), with the exception
of __spec__, which we always override.

Thinking about it, it may make sense in that case to create a new spec
based on the current one but deferring to any of the existing
 import-related attributes of the module.  Then that spec can be set to
__spec__.

> Since exec_module() is going to need to set these anyway for proper exec()
> use during loading then why are you setting them *again* later on? Should
> you set these first and then let the methods reset them as they see fit? I
> thought exec_module() took in a filled-in module anyway, so didn't you have
> to set all the attributes prior to passing it in anyway in step 1.a? In
> that case this is a reset which seems wrong if code explicitly chose to
> change the values.
>

It's not that code chose to change the values.  It's that code chose to
stick some other object in sys.modules and we're going to return it and we
want to be sure the spec and import-related attributes are all properly set.

>
>
>> 6. The module in sys.modules is returned.
>>
>
> Or you can just provide the pseudo-code and skip all of this explanation
> and be easier to follow =) You can leave comments with step numbers if you
> want to expound upon any specific step outside of the pseudo-code:
>

That is a really good idea.  It will make more sense.

>
> class ModuleSpec:
>
>   def load(self):
>     module = self.create()
>     sys.modules[self.name] = module
>
>      try:
>       self.exec(module)
>     except:
>       try:
>         del sys.modules[self.name]
>       except KeyError:
>         pass
>     else:
>       # XXX different from proposal: didn't reset attributes
>       return sys.modules[self.name]
>
>   def create(self):
>     if hasattr(self.loader, 'create_module'):
>       module = self.loader.create_module(self)
>     else:
>       module = types.ModuleType(self.name)
>        # XXX different from proposal: didn't do it blindly after
> create_module()
>       self.init_module_attrs(module)
>     return module
>
>   def exec(self, module):
>     if hasattr(self.loader, 'exec_module'):
>       self.loader.exec_module(module)
>     elif hasattr(self.loader, 'load_module'):
>       self.loader.load_module(self.name)
>       module = sys.modules[self.name]
>     else:
>       raise TypeError('{!r} loader does not have an ' +
>                       'exec_module or load_module
> method'.format(self.loader))
>     return module
>
> ...
>
>
>>  ========================== ===========
>>
> On ModuleSpec              On Modules
>> ========================== ===========
>> name                       __name__
>> loader                     __loader__
>> package                    __package__
>> origin                     __file__*
>> cached                     __cached__*
>>
>
> This shouldn't be set on extension modules, so this is another asterisk of
> has_location *and* is not None (right?).
>

Correct.  Good point.

>
>
>> submodule_search_locations __path__**
>> loading_info                \-
>> has_location (RO-property)  \-
>> ========================== ===========
>>
>> \* Only if ``is_location`` is true.
>>
>
> Should that be has_location?
>

Yep. :)

> **has_location**
>>
>> .. container::
>>
>>    Some modules can be loaded by reference to a location, e.g. a
>> filesystem
>>    path or a URL or something of the sort.  Having the location lets you
>>    load the module, but in theory you could load that module under various
>>    names.
>>
>>    In contrast, non-located modules can't be loaded in this fashion, e.g.
>>    builtin modules and modules dynamically created in code.  For these,
>> the
>>    name is the only way to access them, so they have an "origin" but not a
>>    "location".
>>
>>    This attribute reflects whether or not the module is locatable.  If it
>>    is, ``origin`` must be set to the module's location and ``__file__``
>>    will be set on the module.  Furthermore, a locatable module is also
>>    cacheable and so ``__cached__`` is tied to ``has_location``.
>>
>
> That statement about __cached__ is not true for extension modules. You're
> going to need to tweak how you define 'cached' based on this. Either that
> or you can try to use this as a justification for loader.create_module() as
> you can override these semantics there as a pure Python module is more
> common than extension modules (although this doesn't help with the
> ModuleSpec having the wrong information when returned from the finder
> unless the finder itself resets it on the ModuleSpec before returning it).
>

Yeah, I'll need to rework that.

> **create()**
>>
>> .. container::
>>
>>    A new module is created relative to the spec and its import-related
>>    attributes are set accordingly.  If the spec's loader has a
>>    ``create_module()`` method, that gets called to create the module.
>>  This
>>    give the loader a chance to do any pre-loading initialization that
>> can't
>>    otherwise be accomplished elsewhere.  Otherwise a bare module object is
>>    created.  In both cases ``init_module_attrs()`` is called on the module
>>    before it gets returned.
>>
>
> As stated earlier, I don't like the idea of blindly resetting attributes
> if set by create_module().
>

Well, create_module() shouldn't be setting them.  Are you suggesting that
there is a use case for that?

>
>
>>
>> **exec(module)**
>>
>> .. container::
>>
>>    The spec's loader is used to execute the module.  If the loader has
>>    ``exec_module()`` defined, the namespace of ``module`` is the target of
>>    execution.
>>
>
> Wait, what? You suggest it's the module in the signature but
> module.__dict__ in the explanation.
>

That's right.  The module is passed in and then exec_module() does its
thing with module.__dict__.  Do you think exec_module() should directly
take a dict instead?

>
>>  Otherwise the loader's ``load_module()`` is called, which
>>    ignores ``module`` and returns the module that was the actual
>>    execution target.
>>
>
> Are you pulling from sys.modules? Otherwise how are you getting the module
>  from load_module()?
>

load_module() returns the module.  For loaders that don't follow that rule
(and return None), we'll grab the module out of sys.modules.

> And you don't mention that in one case the module is not put into
> sys.modules while in the other case it is (exec_module vs. load_module).
> That dichotomy is going to be messy.
>

That difference should definitely be clear.  What is messy about it?

> Does this need to be separate from load()? If you merge it in then the
> sys.modules semantics are unified within load(). Otherwise you need to make
> this set sys.modules in either case and return from sys.modules.
>

Both load() and reload() call exec().  In the case of load(), it wraps the
exec() call with the requisite sys.modules handling.  In the case of
reload(), the module should already be in sys.modules.  Regardless,
Loader.load_module() is already required to do all the sys.modules handling
so that base should be covered.  If we deprecate that requirement, which we
could, then we have a different story.  If someone calls exec() directly
before a module is ever loaded then the sys.modules handling shouldn't
matter anyway; and if someone does that it means the spec did not come from
module.__spec__ so they probably aren't a casual user.

>
>>  In that case the import-related attributes of that
>>    module are updated to reflect the spec.
>>
>
> Why? If you already set the attributes in the module and inserted it into
> sys.modules previously then you already took care of this. Else you now are
> setting the attributes potentially *three* times (twice in create() from
> loader.create_module() + an explicit call to init_module_attr() and then
> here).
>

Loader.load_module() is still responsible for setting those attributes.
 However, it may have missed one or more (including __spec__).  We want to
make sure all the appropriate import-related attributes get set.
 Furthermore, load_module() may have overridden the values we set
previously.

Given its authority, it may make sense to update module.__spec__ to reflect
the attributes set by the loader.  That way __spec__ indicates the values
used during loading.  On the other hand, by not updating the spec, the
difference between the module attributes and the spec will reflect the ways
in which the loader did not follow the spec.  I've been following that
former line of thinking, but now I'm wondering if the latter would be
better.  Regardless, the pathological case where the module attributes set
by load_module() and the spec don't match should be pretty rare.

As to setting it multiple times, in the worst case the attributes will be
set twice.  Loader.create_module() shouldn't be setting them.

>
>>  In both cases the targeted
>>    module is the one that gets returned.
>>
>
> Huh? What exactly are you returning? You say "actual execution target"
> above for load_module() but "in both cases the target module" here. That
> seems to contradictory.
>

In the load_module() case we return the result of calling load_module() (or
the module in sys.modules if load_module() returns None).  Otherwise we
return the module that was passed in.  In their respective cases both are
the actual execution targets.

I'll reword that.

>
>
>>
>> **load()**
>>
>> .. container::
>>
>>    This method captures the current functionality of and requirements on
>>    ``Loader.load_module()`` without any semantic changes.  It is
>>    essentially a wrapper around ``create()`` and ``exec()`` with some
>>    extra functionality regarding ``sys.modules``.
>>
>>    itself in ``sys.modules`` while executing.  Consequently, the module in
>>    ``sys.modules`` is the one that gets returned by ``load()``.
>>
>>    Right before ``exec()`` is called, the module is added to
>>    ``sys.modules``.  In the case of error during loading the module is
>>    removed from ``sys.modules``.  The module in ``sys.modules`` when
>>    ``load()`` finishes is the one that gets returned.  Returning the
>> module
>>    from ``sys.modules`` accommodates the ability of the module to replace
>>    itself there while it is executing (during load).
>>
>>    As already noted, this is what already happens in the import system.
>>    ``load()`` is not meant to change any of this behavior.
>>
>>    If ``loader`` is not set (``None``), ``load()`` raises a ValueError.
>>
>
> Since the loader is required by the initializer for ModuleSpec I don't
> know if this specific check is necessary: EAFP.
>

Yeah, the check will almost always pass.  And if someone does something
they shouldn't they'll get an AttributeError really quickly anyway.

> Open Issues
>> ==============
>>
>> \* The impact of this change on pkgutil (and setuptools) needs looking
>> into.  It has some generic function-based extensions to PEP 302.  These
>> may break if importlib starts wrapping loaders without the tools'
>> knowledge.
>>
>> \* Other modules to look at: runpy (and pythonrun.c), pickle, pydoc,
>> inspect.
>>
>> \* Add ``ModuleSpec.data`` as a descriptor that wraps the data API of the
>> spec's loader?
>>
>
> No. This starts to move this away from ModuleSpec modules being a data
> storage object and more or a level of indirection around loaders.
>

Agreed.  It may be nice to have the easier access to the loader data APIs,
but doesn't quite fit.  ModuleSpec.data as a wrapper was the best I could
think of.

>
>
>>
>> \* How to limit possible end-user confusion/abuses relative to spec
>> attributes (since __spec__ will make them really accessible)?
>>
>>
>> References
>> ==========
>>
>> [1] http://mail.python.org/pipermail/import-sig/2013-August/000658.html
>>
>>
>> Copyright
>> =========
>>
>> This document has been placed in the public domain.
>>
>> ..
>>    Local Variables:
>>    mode: indented-text
>>    indent-tabs-mode: nil
>>    sentence-end-double-space: t
>>    fill-column: 70
>>    coding: utf-8
>>    End:
>>
>>
>> _______________________________________________
>> Import-SIG mailing list
>> Import-SIG at python.org
>> http://mail.python.org/mailman/listinfo/import-sig
>>
>>
>
Thanks for that review.  It helps a lot.  I'll update the PEP when I get a
chance.  From your feedback I've gathered a few things:

1. The PEP needs to be more clear on what the Loader methods (both existing
and new) are supposed to accomplish and their responsibilities regarding
sys.modules and module attributes.
2. I need to slide more toward "do less" than I have been in the balance
between keeping things simple and covering all the corner cases.
3. Whether or not the spec should be updated to reflect the attributes set
by load_module() still needs some consideration.
4. The purpose of exec() vs. load() needs some clarification (pseudo-code
should help).
5. I need more sleep. :)

-eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20130828/ab0e547a/attachment-0001.html>