[Import-sig] Re: Long-awaited imputil comments

Wed, 16 Feb 2000 06:01:32 -0800 (PST)

On Sun, 13 Feb 2000, Guido van Rossum wrote:
> I've finally got to a full close-reading of imputil.py. There sure is
> a lot of good stuff there.

Excellent! Thanx for taking the time and providing the great feedback.

> At the same time, I think it's far from
> ready for prime time in the distribution.

Understood. I think some of your concerns are based on its historical
cruftiness, rather than where it "should" be. I'll prep a new version with
the backwards compat pulled out. People that relied on the old version can
continue to use their older version (which is Public Domain, so they're
quite free to do so :-)

>...
> class ImportManager:
> 
>     General comment:
> 
>         I would like to use/subclass the ImportManager in rexec (in
>         order to emulate its behavior), but for that to work, I would
>         need to change all references to sys.path and sys.modules (and
>         sys.whatever) to self.whatever, but that would currently
>         require rewriting large pieces of code. It would be nice if
>         the sys module were somehow passed in.  I have a feeling the
>         same is true for all references to the os module (_os_stat
>         etc.) because the rexec module also wants to have control over
>         what pieces of the filesystem you have access to.  This
>         explains some of the complexity of ihooks (which is currently
> 	used by rexec).

All right. Shouldn't be a problem.

>     def install():
> 
>         The __chain_* code seems in transition (e.g. some functions
>         end with both raise and return)

The raise/return thingy is simply because I wanted to catch all cases
where it didn't import the module and was about to head towards the
default importer via the chain.

>         The hook mechanism for 1.6 hasn't been designed yet; what
>         should it be?

Part of this depends on the policy that you would like to proscribe. There
are two issues that I can think of:

1) are import hooks per-builtin-namespace, or per-interpreter? The former
   is the current model, and may be necessary for rexec types of
   functionality. The latter would shift the hook to some functions in the
   sys module, much like the profiler/trace hooks.

2) currently, Python has a single hook, with no provisions for multiple
   mechanisms to be used simultaneously. This was one of the primary
   advantages of imputil and its policies/chaining -- it would allow
   multiple import mechanisms. I believe that we want to continue the
   current policy: the core interpreter sees a single hook function.

   [ and Standard Operating Procedure is to install an ImportManager in
     there, or a subclass ]

For a while, I had thought about the "sys" approach, but just realized
that may be too limited for rexec types of environments (because we may
need a couple ImportManagers to be operating within an interpreter)

BUT: we may be able to design an RExecImportManager that remembers
restricted modules and uses different behavior when it determines the
import context is one of those modules. I haven't thought on this to
determine is true viability, tho...

>     def add_suffix():
> 
>         It seems the suffixes variable is only used by the
>         _FilesystemImporter. Since it is shared, calls to add_suffix()
>         will have an effect on the _FilesystemImporter instance. I
>         think it would make more sense if the suffixes table was
>         initialized and managed by the _FilesystemImporter; the
>         add_suffix method on the ImportManager could then simply pass
>         its arguments on to the _FilesystemImporter.

Agreed. ImportManager had the suffix list for a while because it needed
it. The code was shifted to _FilesystemImporter, I didn't revisit the
placement of the suffixes. I think that I was also leaving it there with
an intent that it is part of the public interface, subject to more complex
manipulations than the simple add_suffix() would allow. I believe we can
solve this latter problem, though, just by adding a get_suffixes() method
that fetches the list from fs_imp.

>     def _import_hook():
> 
>         I think we need a hook here so that Marc-Andre can implement
>         walk-me-up-Scotty; or Tim Peters could implement a policy that
>         requires all imports to use the full module name, even from
>         within the same package.

I've been thinking of something along the lines of
_determine_import_context() returning a list of things to try. Default is
to return something like [current-context,] + sys.path (not exactly that,
but you get the idea). The _import_hook would then operate as a simple
scan over that list of places to attempt an import from. MAL could
override _determine_import_context() to return the walk-me-up, intervening
packages. Tim could just always return sys.path (and never bother trying
to determine the current context).

>         top_module = sys.modules[parts[0]]:
>             There's an undocumented convention that sys.modules[x] may
>             be None, to indicate that module x doesn't exist and that
>...

Yes, I'm familiar with that mechanism, and it would be a good addition.

>...
>             Anyway, I think that your logic here doesn't expect that
>             to happen.  A fix could be to put "if not top_module:
>             raise KeyError" inside the try/except.

I've attempted to avoid it, but this recent rewrite may have introduced
things like this -- where ImportManager operates as if it the *only* thing
performing imports. It's certainly friendly when it doesn't see __ispkg__
or __importer__, but this top_module thing is a valid bug. Quite fixable.

>     def _determine_import_context():
> 
>         Possible feature: the package could set a flag here to force
>         all imports to go to the top-level (i.e., the flag would mean
>         "no relative imports").

Ah. Neat optimization. I'll insert some comments / prototype code to do
this.

>     def _import_top_module():
> 
>         Instead of type(item)..., use isinstance().

Can do.

>         Looking forward to _FilesystemImporter: I want to be able to
>         have the *name* of a zip file (or other archive, whatever is
>         supported) in sys.path; that seems more convenient for the
> 	user and simplifies $PYTHONPATH processing.

Hmm. This will complicate things quite a bit, but is doable. It will also
increase the processing time for sys.path elements.

I'll think of a design and maybe prototype something up after the current 
round of changes.

>     def _reload_hook():
> 
>         Note that reload() does NOT blast the module's dict; for
> 	better or for worse.  (Occasionally modules know this and save
> 	important global data.)

All righty.

> class Importer:
> 
>     def install():
> 
>         This should be a method on the manager.  (That makes it easier
> 	to change references to sys.path etc.; see my rexec notes above.)

This is here for backwards compat. I'll remove it.

>     def import_top():
> 
>         This appears a hook that a base class can override; but why?

It is for use by clients of the Importer. In particular, by the
ImportManager. It is not intended to be overridden.

>     "PRIVATE METHODS":
> 
>         These aren't really private to the class; some are used from
> 	the ImportManager class.

Private to the system, then :-)

>         I note that none of these use any state of the class (it
>         doesn''t *have* any state) and they are essentially an
>         implementation of the (cumbersome) package import policy.
>         I wonder if the whole package import policy shouldn't be
>         implemented in the ImportManager class -- or even in a
> 	separate policy class.

Nope. They do use a piece of state: self. Subclasses may add state and
they need to refer to that state via self. We store Importer instances
into modules as __importer__ and then use it later. The example Importers
(FuncImporter, PackageImporter, DirectoryImporter, and PathImporter) all
use instance variables. _FilesystemImporter definitely requires it.

Once we locate the Importer responsible for a package and its contained
modules, then we pass off control to that Importer. It is then responsible
for completing the import (including the notions of a Python package). The
completion of the import can be moved out of Importer, and we would
replace "self" by the Importer in question.

I attempted to minimize code in ImportManager because a typical execution
will only have a single instance of it -- there is no opportunity to
implement different policies and mechanisms. By shifting as much as
possible out to the Importer class, there is more opportunity for altering
behavior.

>     def get_code():
> 
>         On the 2/3-tuple return value: a proposal for code to be
>         included in 1.6 shouldn't be encumbered by backwards
>         compatibility issues to previous versions of the proposal.

No problem. Consider the 2-tuple form to be gone.

>         I'm still worried about get_code() returning the code object
>         (or even a module, in the case of extensions).  This means
>         that something like freeze might have to re-implement the
>         whole module searching -- first, it might want the source code
>         instead of the code object, and second, it might not want the
>         extension to be loaded at all!

You're asking for something that is impossible in practice. Consider the
following code fragment:

------------------------
import my_imp_tools
my_imp_tools.HTTPImporter("http://www.lyra.org/greg/python/")

import qp_xml
------------------------

There is no way that you can create a freeze tool that will be able to
manage this kind of scenario.

I've frozen software for three different, shipping products. In all cases,
I had to custom-build a freeze mechanism. Each time, there would be a list
of "root" scripts, a set of false-positives to ignore, a set of missed
modules to force inclusion, and a set of binary dependencies that couldn't
be determined otherwise. I think the desire for a fully-automated module
finder and freezer isn't fulfillable.

That said: I wouldn't be opposed to adding a get_source() method to an
Importer. If the source is available, then it can return it (it may not be
available in an archive!).

>         I've noticed that all implementations of get_code() start with
>         a test whether parent is None or not, and branch to completely
>         different code.  This suggests that we might have two separate
>         methods???

Interesting point. I hadn't noticed that. They don't always branch to
different points, however: consider DirectoryImporter and FuncImporter.

Essentially, we have two forms of calls:

    get_code(None, modname, modname)     # look for a top-level module
    get_code(parent, modname, fqname)    # look in a package for a module

imputil has been quite nice because you only had to worry about one hook,
but separating these might be a good thing to do. My preference is to
leave it as one hook, but let's see what others have to say...

> "Some handy stuff for the Importers":

Consider all this torched.

I'll also move the Importer subclasses to a new file for placement under
Demo/. That should trim imputil.py down quite a lot.

>...
>     def _compile():
> 
>         You will have to strip CRLF from the code string read from the
>         file; this is for Unix where opening the file in text mode
>         doesn't do that, but users have come to expect this because
>         the current implementation explicitly strips them.

I'm not sure that I follow what you mean here. The existing code seems to
work fine. We *append* a newline; are you suggesting stripping inside the
codestring, at the end, ??

>         I've recently been notified of a race condition in the code
>         here, when two processes are writing a .pyc file and a third
>         is reading it.  On Unix we will have to remove the .pyc file
>         and then use os.open() with O_EXCL in order to avoid the race
>         condition.  I don't know how to avoid it on Windows.

The O_EXCL would be on writing. It would be nice if there was a "shared"
mode for reading. How are two writers and a reader different from a single
writer and a single reader?

>     def _os_bootstrap():
> 
>         This is ugly and contains a dangerous repetition of code from
>         os.py.  I know why you want it but for the "real" version we
>         need a different approach here.

I agree :-)

But short of some refactoring of os.py and the per-platform modules, this
is the best that I could do. Importing "os" loads a lot of stuff; I wanted
to ensure that we deferred that until we had the ImportManager and
associated Importers in place (so the import could occur under the
direction of imputil).

>     def _fs_import():
> 
>         I think this is unused now?  Anyway, it hardcodes the suffix
>         sequence.

Unused. It can be ignored.

> class SuffixImporter:
> 
>     Why is this a class at all?  It would seem to be sufficient to
>     have a table of functions instead of a table of instances.  These
>     importers have no state and only one method that is always
>     overridden.

DynLoadSuffixImporter has state. We could still deal with that, however,
by just storing a bound method into the table.

I used instances because I wasn't sure what all we might want in there. If
we don't add any other methods or attributes to the public interface, then
yah: we could switch to a function-based approach.

I'll release a new imputil later this week, incorporating these changes,
MAL's feedback, and Finn Bock's feedback.

Thanx!
-g

-- 
Greg Stein, http://www.lyra.org/