[Import-sig] Re: Long-awaited imputil comments

Wed, 16 Feb 2000 12:27:14 -0500

> Excellent! Thanx for taking the time and providing the great feedback.

And thanks for the replies.  Quick responses:

> > At the same time, I think it's far from
> > ready for prime time in the distribution.
> 
> Understood. I think some of your concerns are based on its historical
> cruftiness, rather than where it "should" be. I'll prep a new version with
> the backwards compat pulled out. People that relied on the old version can
> continue to use their older version (which is Public Domain, so they're
> quite free to do so :-)

OK, I'm awaiting a new imputil announcement.  (You should really have
a webpage for it pointing both to the old and the new version.)

> >         The hook mechanism for 1.6 hasn't been designed yet; what
> >         should it be?
> 
> Part of this depends on the policy that you would like to proscribe. There
> are two issues that I can think of:
> 
> 1) are import hooks per-builtin-namespace, or per-interpreter? The former
>    is the current model, and may be necessary for rexec types of
>    functionality. The latter would shift the hook to some functions in the
>    sys module, much like the profiler/trace hooks.

Yes, per-builtin-namespace is necessary because of rexec -- the system
must translate an import statement into a call to a hook that depends
on the builtin namespace, because that's how rexec environments (there
may be many per interpreter) are denoted.

> 2) currently, Python has a single hook, with no provisions for multiple
>    mechanisms to be used simultaneously. This was one of the primary
>    advantages of imputil and its policies/chaining -- it would allow
>    multiple import mechanisms. I believe that we want to continue the
>    current policy: the core interpreter sees a single hook function.

Yes.

>    [ and Standard Operating Procedure is to install an ImportManager in
>      there, or a subclass ]

Yes.

> For a while, I had thought about the "sys" approach, but just realized
> that may be too limited for rexec types of environments (because we may
> need a couple ImportManagers to be operating within an interpreter)

Agreed.  It sounds like we're stuck with overriding
__builtin__.__import__, like before.

> BUT: we may be able to design an RExecImportManager that remembers
> restricted modules and uses different behavior when it determines the
> import context is one of those modules. I haven't thought on this to
> determine is true viability, tho...

Sounds tricky -- *all* modules imported in restricted mode must be
treated differently.

> >     def _import_hook():
> > 
> >         I think we need a hook here so that Marc-Andre can implement
> >         walk-me-up-Scotty; or Tim Peters could implement a policy that
> >         requires all imports to use the full module name, even from
> >         within the same package.
> 
> I've been thinking of something along the lines of
> _determine_import_context() returning a list of things to try. Default is
> to return something like [current-context,] + sys.path (not exactly that,
> but you get the idea). The _import_hook would then operate as a simple
> scan over that list of places to attempt an import from. MAL could
> override _determine_import_context() to return the walk-me-up, intervening
> packages. Tim could just always return sys.path (and never bother trying
> to determine the current context).

Yes.  In ni (remember ni?) we had this mechanism; it was called
"domain" (not a great name for it).  The domain was a list of packages
where relative imports were sought.  A package could set its domain by
setting a variable __domain__.  The current policy (current package,
then toplevel) corresponds to a 2-item domain: [<packagename>, ""]
<(where "" stands for the unnamed toplevel package).
Walk-me-up-Scotty corresponds to a domain containing the current
package, its parent, its grandparent, and so on, ending with "".  The
"no relative imports" policy is represented [""].

If we let __domain__ be initialized by the importer but overridden by
the package, we can do everything we need.

> >     def _determine_import_context():
> > 
> >         Possible feature: the package could set a flag here to force
> >         all imports to go to the top-level (i.e., the flag would mean
> >         "no relative imports").
> 
> Ah. Neat optimization. I'll insert some comments / prototype code to do
> this.

See above -- the package could simply set its domain to [""].

> >         Looking forward to _FilesystemImporter: I want to be able to
> >         have the *name* of a zip file (or other archive, whatever is
> >         supported) in sys.path; that seems more convenient for the
> > 	user and simplifies $PYTHONPATH processing.
> 
> Hmm. This will complicate things quite a bit, but is doable. It will also
> increase the processing time for sys.path elements.

We could cache this in a dictionary: the ImportManager can have a
cache dict mapping pathnames to importer objects, and a separate
method for coming up with an importer given a pathname that's not yet
in the cache.  The method should do a stat and/or look at the
extension to decide which importer class to use; you can register new
importer classes by registering a suffix or a Boolean function, plus a
class.  If you register a new importer class, the cache is zapped.
The cache is independent from sys.path (but maintained per
ImportManager instance) so that rearrangements of sys.path do the
right thing.  If a path is dropped from sys.path the corresponding
cache entry is simply no longer used.

> >     def import_top():
> > 
> >         This appears a hook that a base class can override; but why?
> 
> It is for use by clients of the Importer. In particular, by the
> ImportManager. It is not intended to be overridden.

Are there any other clients of the Importer class?  Since import_top()
simply calls _import_one(), do we even need it?

> >     "PRIVATE METHODS":
> > 
> >         These aren't really private to the class; some are used from
> > 	the ImportManager class.
> 
> Private to the system, then :-)

Then use a different word -- "private" has a well-defined meaning for
C++ and Java programmers.  To me, "internal" sounds better.

> >         I note that none of these use any state of the class (it
> >         doesn''t *have* any state) and they are essentially an
> >         implementation of the (cumbersome) package import policy.
> >         I wonder if the whole package import policy shouldn't be
> >         implemented in the ImportManager class -- or even in a
> > 	separate policy class.
> 
> Nope. They do use a piece of state: self. Subclasses may add state and
> they need to refer to that state via self. We store Importer instances
> into modules as __importer__ and then use it later. The example Importers
> (FuncImporter, PackageImporter, DirectoryImporter, and PathImporter) all
> use instance variables. _FilesystemImporter definitely requires it.
> 
> Once we locate the Importer responsible for a package and its contained
> modules, then we pass off control to that Importer. It is then responsible
> for completing the import (including the notions of a Python package). The
> completion of the import can be moved out of Importer, and we would
> replace "self" by the Importer in question.
> 
> I attempted to minimize code in ImportManager because a typical execution
> will only have a single instance of it -- there is no opportunity to
> implement different policies and mechanisms. By shifting as much as
> possible out to the Importer class, there is more opportunity for altering
> behavior.

But the importer is the wrong place to change the policy globally.
The example is walk-me-up-Scotty: to implement that (or, more
generally, to implement the __domain__ hook) without editing
imputil.py, Marc-Andre would have to have to subclass all the
importers that are used.  If the policy was embodied in the
ImportManager class, he could subclass the ImportManager, install it
instead of the default one, but continue to use the existing importers
(e.g. to import from zip files).

> >         I'm still worried about get_code() returning the code object
> >         (or even a module, in the case of extensions).  This means
> >         that something like freeze might have to re-implement the
> >         whole module searching -- first, it might want the source code
> >         instead of the code object, and second, it might not want the
> >         extension to be loaded at all!
> 
> You're asking for something that is impossible in practice. Consider the
> following code fragment:
> 
> ------------------------
> import my_imp_tools
> my_imp_tools.HTTPImporter("http://www.lyra.org/greg/python/")
> 
> import qp_xml
> ------------------------
> 
> There is no way that you can create a freeze tool that will be able to
> manage this kind of scenario.
> 
> I've frozen software for three different, shipping products. In all cases,
> I had to custom-build a freeze mechanism. Each time, there would be a list
> of "root" scripts, a set of false-positives to ignore, a set of missed
> modules to force inclusion, and a set of binary dependencies that couldn't
> be determined otherwise. I think the desire for a fully-automated module
> finder and freezer isn't fulfillable.

I know you can't do it fully automatically.  But I still want to be
able to reuse as much of existing importer and importmanager classes
as possible.  Currently, Tools/freeze/modulefinder.py contains a lot
of code that reimplements the entire package resolution mechanism.
It should really be able to reuse all that -- so that e.g. the
addition of a __domain__ feature doesn't require changes both to
imputil.py and to modulefinder.py.

> That said: I wouldn't be opposed to adding a get_source() method to an
> Importer. If the source is available, then it can return it (it may not be
> available in an archive!).

How would you invoke it?  I would need to have a different call into
the ImportManager that would invoke get_source() rather than
get_code().  My desire is that freeze's modulefinder should be able to
instantiate another ImportManager (maybe a slight subclass) which
would find the source code for modules for it.  For this, I need to
have an API that say "in this context, what module does 'import X'
return?".  It's okay to return some kind of descriptor object that has
methods to (1) get the code, (2) get the source, (3) describe itself.
The description should return whether this is a Python or an extension
module, whether source is available, and the filename if available.
If it came from an archive, the descriptor could be archive-aware, and
have additional methods to find out the filename of the archive and
the name of the module inside the archive, as well as the type of
archive.  It it was an extension, there would be extra calls to load
the library and to initialize the module, and a way to get the actual
filename.  This way, freeze could issue reasonable errors if it found
a module but couldn't find the source.  Freeze also needs to deal
specially with extension (and differently with built-in extensions and
shared library ones).

> >         I've noticed that all implementations of get_code() start with
> >         a test whether parent is None or not, and branch to completely
> >         different code.  This suggests that we might have two separate
> >         methods???
> 
> Interesting point. I hadn't noticed that. They don't always branch to
> different points, however: consider DirectoryImporter and FuncImporter.

But those are the most trivial examples.

> Essentially, we have two forms of calls:
> 
>     get_code(None, modname, modname)     # look for a top-level module
>     get_code(parent, modname, fqname)    # look in a package for a module
> 
> imputil has been quite nice because you only had to worry about one hook,
> but separating these might be a good thing to do. My preference is to
> leave it as one hook, but let's see what others have to say...

You might provide a default implementation for get_code() that calls
either get_subcode() or get_topcode() depending on whether parent is
None; then subclasses can separately override those, or override
get_code() when it's more convenient.

I noticed there's only one call to get_code(), from _import_one().
Not sure where this leads though.

> I'll also move the Importer subclasses to a new file for placement under
> Demo/. That should trim imputil.py down quite a lot.

Except for the _FilesystemImporter class, I presume, which is needed
in normal use.

> >...
> >     def _compile():
> > 
> >         You will have to strip CRLF from the code string read from the
> >         file; this is for Unix where opening the file in text mode
> >         doesn't do that, but users have come to expect this because
> >         the current implementation explicitly strips them.
> 
> I'm not sure that I follow what you mean here. The existing code seems to
> work fine. We *append* a newline; are you suggesting stripping inside the
> codestring, at the end, ??

I mean that you have to do

  codestring = re.sub(r"\r\n", r"\n", codestring)

on the code string after reading it.  This has nothing to do with the
trailing newline.  It is needed because the tokenizer chokes on \r\n
when it finds it in a string, but not when it reads it from a file --
this has been reported a few times as a bug because it means that

  exec compile(open(fn).read(), fn, "exec")

is not completely equivalent to execfile(fn) -- the compile() may
choke if the read() returns lines ending in \r\n, as it may when a
Windows file was transplanted to a Unix system.  Again, when the
parser itself reads the file, this is dealt with correctly.  This is a
feature: many people share filesystems between Unix and Windows, and
just like most Windows compilers don't insist on the \r being there,
Unix tools shouldn't insist on it being absent.  Fixing compile() is
hard, unfortunately, hence this request for a workaround.

> >         I've recently been notified of a race condition in the code
> >         here, when two processes are writing a .pyc file and a third
> >         is reading it.  On Unix we will have to remove the .pyc file
> >         and then use os.open() with O_EXCL in order to avoid the race
> >         condition.  I don't know how to avoid it on Windows.
> 
> The O_EXCL would be on writing.

Yes of course, sorry for not clarifying that.

> It would be nice if there was a "shared"
> mode for reading. How are two writers and a reader different from a single
> writer and a single reader?

OK, I'll explain the problem.  (Cut from a mail explaining it to
Jeremy:)

| Unfortunately, there's still a race condition, involving three or more
| processes:
| 
|   A sees no .pyc file and starts writing it
| 
|   B sees an invalid .pyc file and decides to go write it later
| 
|   A finishes writing and fills in the mtime
| 
|   C sees the valid magic and mtime and decides to go read the .pyc file
| 
|   B overwrites the .pyc file, truncating it at first
| 
|   C continues to read the .pyc file, but sees a truncated file
| 
|   B finishes writing and fills in the mtime
| 
| At this point, the .pyc file is valid, but process C has probably
| crashed in the unmarshalling code.
| 
| I have devised the following solution (which may even work on
| Windows) but not yet implemented it:
| 
| when writing the .pyc file, use unlink() to remove the .pyc file and
| then use low-level open() with the proper flags to require that the
| file doesn't yet exist (O_EXC:?); then use fdopen().  If the open()
| fails, don't write (treat it the same as a failing fopen() now.)
| 
| (You'd think that you could use a temporary file, but it's hard to
| come up with a temp filename that's unique -- and if it's not unique,
| the same race condition could still happen.)

> >     def _os_bootstrap():
> > 
> >         This is ugly and contains a dangerous repetition of code from
> >         os.py.  I know why you want it but for the "real" version we
> >         need a different approach here.
> 
> I agree :-)
> 
> But short of some refactoring of os.py and the per-platform modules, this
> is the best that I could do. Importing "os" loads a lot of stuff; I wanted
> to ensure that we deferred that until we had the ImportManager and
> associated Importers in place (so the import could occur under the
> direction of imputil).

OK.  Let's table this one until we feel we know how to refactor os.py.
(Maybe a platform-specific os.py could be frozen into the
interpreter.)

> > class SuffixImporter:
> > 
> >     Why is this a class at all?  It would seem to be sufficient to
> >     have a table of functions instead of a table of instances.  These
> >     importers have no state and only one method that is always
> >     overridden.
> 
> DynLoadSuffixImporter has state. We could still deal with that, however,
> by just storing a bound method into the table.

Exactly.

> I used instances because I wasn't sure what all we might want in there. If
> we don't add any other methods or attributes to the public interface, then
> yah: we could switch to a function-based approach.

See http://c2.com/cgi-bin/wiki?YouArentGonnaNeedIt (and the
rest of this wikiweb on refactoring, patterns etc.) for why you
shouldn't plan ahead this far.

> I'll release a new imputil later this week, incorporating these changes,
> MAL's feedback, and Finn Bock's feedback.

Great!

--Guido van Rossum (home page: http://www.python.org/~guido/)