Here's a new patch. Changes include:
- pyc time stamp checking (thanks Skip and </F>!)
- better python -v output
- also works when zipimport is built dynamically
I've written a note about various aspects of the patch (pasted below) but I'm
not sure it's PEP-ready yet. Comments are more than welcome!
Just
---------------------
This note is in a way an addendum to PEP 273. I fully agree the points and goals
of the PEP, except for the sections "Efficiency", "Directory Imports" and
"Custom Imports". However, I disagree strongly with much of the design and
implementation of PEP 273. This note presents an alternative, with a matching
implementation.
A brief history of import.c -- digging through its cvs log.
When Python was brand new, there were only builtin modules and .py files. Then
.pyc support was added and about half a year later support for dynamically
loaded C extension was implemented. Then came frozen modules. Then Guido rewrote
much of import.c, introducing the filedescr struct {suffix, mode, type},
allowing for some level of (builtin) extension of the import mechanism. This was
just before Python 1.1 was released. Since then, the only big change has been
package support (in 1997), which added a lot of complexity. (The __import__ hook
was quietly added in 1995, it's not even mentioned in the log entry of ceval.c
r2.69, I had to do a manual binary search to find it...)
All later import extensions were either implemented using the filedescr
mechanism and/or hardcoded in find_module and load_module. This ranges from
reading byte code from Macintosh resources to Windows registry-based imports.
Every single time this involved another test in find_module() and another branch
in the load_module() switch. "This has to stop."
The PEP 273 implementation.
Obviously the PEP 273 implementation has to add *something* to import.c, but it
makes a few mistakes:
- it's badly factored (for example it adds the full implementation of
reading zip files to import.c.)
- it adds a questionable new feature: directory caching. The original
author claimed this was needed for zip imports, but instead solving
the problem locally for zip files the feature is added to the builtin
import mechanism as a whole. Although this causes some speedup
(especially for network file system imports), this is bad, for several
reasons:
- it's not strictly *needed* for builtin import
- it's not a frequent feature request from users (as far as I know)
- it makes import.c even more complicated than it already is (note that
I say "complicated", not "complex")
- it changes semantics: if a module is added to the file system *after*
the directory contents has been cached, it will not be found. This
might only be a real issue for an IDE that runs code inside the IDE
process, but still.
A different approach.
An __import__ hook is close to useless for the problem at hand, as it needs to
reimplement much of import.c from scratch. This can be witnessed in Guido's old
ihooks.py, Greg Stein's imputils.py and Gordon McMillan's iu.py, each of which
are failry complex, and not compatible with any other. So we either have to add
just another import type to import.c for zip archives, or we can add a more
general import hook. Let's assume for a moment we want to do the *former*.
The most important goal is for zip file names on sys.path and PYTHONPATH to
"just work" -- as if a zip archive is just another directory. So when traversing
sys.path, each item must be checked for zip-file-ness, and if it is, the zip
file's file index needs to be read so we can determine whether the module being
imported is in there.
I went for an OO approach, and represent a zip file with an instance of the
zipimporter class. Obviously it's quite expensive to read the zip file index
again and again, so we have to maintain a cache of zipimporter objects. The most
Pythonic approach would be to use a dict, using the sys.path item as the key.
This cache could be private to the zip import mechanism, but it makes sense to
also cache the fact that a sys.path item is *not* a zip file. A simple solution
is to map such a path item to None. By now it makes more sense to have this
cache available in import.c.
The zipimporter protocol.
The zipimporter's constructor takes one argument: a path to a zip archive. It
will raise an exception if the file is not found or if it's not a zip file.
The import mechanism works in two steps: 1) find the module, 2) if found, load
the module. The zipimporter object follows this pattern, it has two methods:
find_module(name):
Returns None if the module wasn't found, or the
zipimporter object itself if it was.
load_module(fullname):
Load the module, return it (or propagate an exception).
The main path traversing loop in import.c will then look like this (omitting the
caching mechanics for brevity):
def find_module(name, path):
if isbuiltin(name):
return builtin_filedescr
if isfrozen(name):
return frozen_filedescr
if path is None:
path = sys.path
for p in sys.path:
try:
v = zipimporter(p)
except ZipImportError:
pass
else:
w = v.find_module(name)
if w is not None:
return w
...handle builtin file system import...
Packages.
Paths to subdirectories of the zip archive must also work, on sys.path for one,
but most importantly for pkg.__path__. For example: "Archive.zip/distutils/".
Such a path will most likely be added *after* "StdLib.zip" has been read (after
all, the parent package is *always* loaded before any submodules), so all I need
to do is strip the sub path, and look up the bare .zip path in the cache. A
*new* zipimporter instance is then created, which references the same (internal,
but visible) file directory info as the "bare" zipimporter object. A .prefix
contains the sub path:
>>> from zipimport import zipimporter
>>> z = zipimporter("Archive.zip/distutils")
>>> z.archive
'Archive.zip'
>>> z.prefix
'distutils/'
>>>
Beyond zipimport.
So there we are, zipimport works, with just a relatively minor impact on
import.c. The obvious next question is: what about other import types, whether
future needs for the core, or third party needs? It turns out the above approach
is *easily* generalized to handle *arbitrary* path-based import hooks. Instead
of just checking for zip-ness, it can check a list of candidates (again, caching
cruft omitted):
def find_module(name, path):
if isbuiltin(name):
return builtin_filedescr
if isfrozen(name):
return frozen_filedescr
if path is None:
path = sys.path
for p in sys.path:
v = None
for hook in sys.import_hooks:
try:
v = hook(p)
except ImportError:
pass
else:
break
if v is not None:
w = v.find_module(name)
if w is not None:
return w
...handle builtin file system import...
Now, one tiny step further, and we have something that fairly closely mimics
Gordon McMillan's iu.py. That tiny step is what Gordon calls the "metapath". It
works like this:
def find_module(name, path):
for v in sys.meta_path:
w = v.find_module(name, path)
if w is not None:
return w
# fall through to builtin behavior
if isbuiltin(name):
return builtin_filedescr
[ rest same as above ]
An item on sys.meta_path can override *anything*, and does not need an item on
sys.path to get invoked. The find_module() method of such an object has an extra
argument: path. It is None or parent.__path__. If it's None, sys.path is
implied.
The Patch.
I've modified import.c to support all of the above. Even the path handler cache
is exposed: sys.path_importers. (This is what Gordon calls "shadowpath"; I'm not
happy with either name, I'm open to suggestions.) The sys.meta_path addition
isn't strictly neccesary, but it's a useful feature and I think generalizes and
exposes the import mechanism to the maximum of what is possible with the current
state of import.c.
The protocol details are open to discussion. They are partly based on what's
relatively easily doable in import.c. Other than that I've tried to follow
common sense as to what is practical for writing import hooks.
The patch is not yet complete, especially regarding integration with the imp
module: you can't currently use the imp module to invoke any import hook. I have
some ideas on how to do this, but I'd like to focus on the basics first. Also:
the reload() function is currently not supported. This will be easy to fix
later.
I *thought* about allowing objects on sys.path (which would then work as an
importer object) but for now I've not done it as sys.meta_path makes it somewhat
redundant. It would be easy to do, though: it would add another 10 lines or so
to import.c.
I've tested the zipimporter module both as a builtin and as a shared lib: it
works for me in both configurations. But when building it dynamically: it _has_
to be available on sys.path *before* site.py is run. When running from the build
dir on unix: add the appropriate build/lib.* dir to your PYTHONPATH and it
should work.