[Patches] [ python-Patches-652586 ] New import hooks + Import from Zip files

Mon, 23 Dec 2002 14:31:36 -0800

Patches item #652586, was opened at 2002-12-12 11:34
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=652586&group_id=5470

Category: Core (C code)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Just van Rossum (jvr)
Assigned to: Nobody/Anonymous (nobody)
Summary: New import hooks + Import from Zip files

Initial Comment:
This patch implements two things:
- a new set of import hooks, modelled after iu.py
- builtin support for imports from Zip archives (a
competing implementation for PEP 273)

The new set of hooks probably need a better document
explaining them (perhaps a PEP). My motivations have
been posted to python-dev.

Here's a brief description.

Three new objects are added to the sys module:
- path_hooks
- path_importer_cache
- meta_path

sys.path_hooks is a list of callable objects that take
a string as their only argument. A hook will be called
with a sys.path or pkg.__path__ item. It should return
an "importer" object (see below), or raise ImportError
or return None if it can't deal with the path item. By
default, sys.path_hooks only contains the zipimporter
type, if the zipimport module is available.

sys.path_importer_cache is a dict that caches the
results of sys.path_hooks to avoid repeated hook lookups.

sys.meta_path is a list of importer objects that are
invoked *before* the builtin import mechanism kicks in.
This allows overriding of builtin module and frozen
module import, but the main feature is that it allows
importer objects *without* a corresponding sys.path
item (just like builtin and frozen modules).

Importer objects must conform to the following protocol:

i.find_module(fullname) -> None or an importer object
i.load_module(fullname) -> the imported module (or
raise ImportError)

The 'fullname' is always the fully qualified module
name, ie. a dotted name for a submodule.

This patch adds one more feature: a sys.path item may
*itself* be an importer object. This is convenient for
experimentation, but using it may break third-party
code that assumes sys.path contains only strings.

----------------------------------------------------------------------

>Comment By: Just van Rossum (jvr)
Date: 2002-12-23 23:31

Message:
Logged In: YES 
user_id=92689

Several changes/fixes.

general:
- incorporated patch from Paul Moore that adds a default zip
archive path to sys.path on Windows (it was already there on
unix). Thanks Paul!

import.c:
- matches latest version of PEP 302 (rev 1.3), regarding the
new 'path' argument of importer.find_module(fullname, path=None)

zipimporter.c
- removed the subdir feature, which allowed the path to the
zip archive to be extended with a subdirectory. PEP 273
stated this was needed for package support (and only for
that). However, with the new import hooks this is no longer
true: a path item containing the plain zip archive path can
also deal with submodules (find_module receives the full
module name after all). Therefore a pkg.__path__ from a
package loaded from a zip archive will contain the *plain*
zip archive path.
- as a consequence I could simplify and clean up lots of
things (esp. zipimporter_init: eg. it no longer needs to
check sys.path_importer_cache; yay). Getting rid of the
zipimporter.prefix attribute altogether helped a lot in
other places.
- this change additionally enabled me to get rid of the
restriction that zip paths must end in .ZIP or .zip; any
extension (or even no extension) will now work.
- implemented all the (optional) extensions of the Importer
Protocol that the latest version of PEP 302 defines.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-19 22:29

Message:
Logged In: YES 
user_id=92689

I added the default sys.path zip file item for unix builds,
as PEP 273 prescribes. I've not done the refactoring that
the PEP 273 implementation does (I really don't see why),
but this means I can't use Jim Ahlstrom's PC/getpathp.c
patch as is. I don't develop on Windows, so I wouldn't be
able to test. Paul, do you think you can provide a minimal
patch for PC/getpathp.c that adds the right thing to the
default path? No rush.

Other changes:
- zipimporter.get_data() now raises IOError instead of
ZipImportError if the "file" wasn't found.
- removed the (dubious) mechanism that would *not* add a
None value on sys.path_importer_cache for path items that
have no hook and are not directories. Needed for the default
zip item change: if the default zipfile isn't available it
would otherwise keep trying to find a hook for it. This
change also means that if you install a hook on
sys.path_hooks and you want that hook to handle items that
are *already* on sys.path, you must clear
sys.path_importer_cache.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-18 15:54

Message:
Logged In: YES 
user_id=92689

Std zip entry on sys.path: agreed, I've been meaning to look
at that for a while now...

Refactoring: would be nice (but needs a third item:
PathImporter), but it's quite a bit of extra work *and*
makes it harder (I think) to keep
imp.find_module/load_module compatible. I will have a closer
look, but I tend to think this is a better project for 2.4.
I currently prefer working on a PEP, if only to increase the
chance my work hasn't been a waste of time ;-)

Btw. zi.get_data() currently raises ZipImporterError if a
file is not found. IOError is probably better. Will also
consider list_dir() & list_modules(). (Hm, I really dislike
those underscores there, but I also don't want to remove
them from find/load_module...)

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-12-18 14:31

Message:
Logged In: YES 
user_id=6380

OTOH I will certainly be able to work around the lacking
stat()ability; but I *need* to access the data.

I'm occupied with urgent Zope Corp work this week, so I
don't have time to review more of this code, alas. But I
recommend that Just incorporate the changes that Jim
Ahlstrom suggested (from Jim's patch) to allow the standard
library to be a zip archive (adding a standard zip file on
the path).

If after that you still have time and energy, I'd like the
search for builtins and for frozen modules to be turned into
meta-importers (or whatever you call them) that are placed
on the meta-importers hook list. That would be a nice
refactoring of import.c!

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-18 10:47

Message:
Logged In: YES 
user_id=92689

I *totally* agree that get_data() should be an optional part
of the protocol. I consider it an experimental addition
right now.

Quick thoughts:
__importer__ should be set to the object with the
load_module() method.
Yes, load_module() is responsible for setting it. It has to
be, as it is load_module() that actually runs the code, and
__import__ must be available to the running code.

PEP: Yes. I started on it last night, doing some work on it
as we speak.

----------------------------------------------------------------------

Comment By: Paul Moore (pmoore)
Date: 2002-12-18 10:26

Message:
Logged In: YES 
user_id=113328

Be aware that this doesn't solve all of the issues (although it 
does solve a large class of them). Some applications require 
a real filename (wxPython did until recently for image data) 
and some need to stat the file (Guido quoted Zope).

At the very least it should be made clear that this is an 
*optional* part of the interface. Application code should 
always be prepared for an importer not to support get_data (if 
only by saying it can't work from a data store managed by 
this sort of hook).

My preference is for keeping the importer interface minimal 
and clean. At the moment 2 functions (find_module and 
get_module) are all that is required - and these can be in 
separate classes. Which class should have the get_data 
method? Which class gets assigned to __importer__? Is it 
the hook's responsibility to set __importer__ (yet another 
responsibility)?

I think it's time for a PEP. Not for zipimporter, but purely for 
the import hook interface. That would be the place to argue 
such issues as I raised in the paragraph above. We need to 
remain clear on what the API requires vs what zipimporter 
does, to give writers of new hooks a decent chance.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-17 23:28

Message:
Logged In: YES 
user_id=92689

Fresh patch.
Added zipimporter.get_data(path) method: returns the
(uncompressed) data for a file as a string.

Implemented Wiktor Sadowski's suggestion: a module loaded
with zipimport will have an __importer__ attribute, which is
set to the zipimporter instance that did the load. Possible
quirk: for a package __init__.py file, this is the
zipimporter that handles the directory containing the
package dir, so __importer__.get_data(filename) looks for
filename in that directory, not the package directory. An
__importer__ of an actual submodule *does* look in the
package directory itself.

(Btw. I did this without adding a new
PyImport_ExecCodeModuleEx2 API.)

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-16 19:41

Message:
Logged In: YES 
user_id=92689

Uploaded new patch; no code changes, but a new test script:
test_importhooks.py (uses unittest). Rewrote
test_zipimport.py to also use unittest. test_importhooks.py
contains some examples of trivial importer objects as well
as an importer wrapper for imp.find_module/imp.load_module
calls. Also contains some notes about the role and
constraints of pkg.__path__.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-15 15:28

Message:
Logged In: YES 
user_id=92689

Uploaded a new version. Many changes, but most are fairly
small. Addressed a fairly long list of comments from Guido.

General:
- lots more comments, especially for functions.
- wrapped long lines
- site.py: doesn't check for existence in the file system at
all anymore

zipimport.c:
- fixed a really dumb bug that prevented zipimporter objects
be found in the cache; this improved performance a bit in
the presence of packages. (On my box, zipimports are now
always faster than file system imports, but the difference
is much better for uncompressed zip files.)
- removed zipimporter_new (PyType_GenericNew is all we need)
- replaced all PyMapping_* calls with PyDict_* calls
- reworked module/package finding code: now really doesn't
depend on dir entries to be present in the archive.
Basically went back to how ahlstromjc did it, which makes
much more sense to me now.
- fixed some refcount bugs
- fixed a potential buffer overrun
- reduced the number of fseek() calls somewhat (XXX there's
an unused and unfinished replacement for read_directory() in
zipimport.c that reads the entire directory into memory in
one gulp. This avoids more fseek() calls and avoids many
PyMarshal_Read* calls. Nevertheless I only managed to shave
off about 10% of dir-reading time, which I think makes it
not worth the hassle. If you think this may have more effect
on your platform: please play!)

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-14 23:12

Message:
Logged In: YES 
user_id=92689

If you install a hook on sys.path_hooks and want it to
affect sys.path items that may already have been used you'd
need to clear sys.path_importer_cache. Say if you install a
directory caching hook after site.py has been imported.
Tricky, but possibly not uncommon. But at the very least the
test script needs it <wink>.

I think it's on a similar level as sys.modules: you normally
don't need it, except when you're mucking with custom import
hooks... (I know, that's not an entirely fair comparison,
but still.)

----------------------------------------------------------------------

Comment By: Paul Moore (pmoore)
Date: 2002-12-14 22:52

Message:
Logged In: YES 
user_id=113328

Thanks for the explanation of find_module. That sounds very 
useful.

On a separate note, why do you expose 
sys.path_importer_cache? I can't imagine any reason why 
the user would ever touch it manually (other than for 
interactive experimentation).

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-14 21:46

Message:
Logged In: YES 
user_id=92689

Btw. it would be nicer if I indeed provided an "importer"
wrapper for the builtin mechanism and simply placed it on
and invoke it through sys.meta_path. The reason I haven't
done that yet is that A) this would change import.c even
more (although it would be better factored then) and B) it
would have _some_ impact on the performance of the builtin
import mechanism, as *all* imports would then pass through
yet another Python call. But I hope to play with this a bit
later next week. It could also be a project for 2.4.

Note that this is exactly what Gordon does in iu.py, except
he goes even further: he has separate importer objects on
the metapath for builtin modules, frozen modules and file
system imports. I think that's fantastic, but it would be
more controversial as it might affect performance and might
be harder to integrate with the current imp module.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-14 21:17

Message:
Logged In: YES 
user_id=92689

Paul, this is the reason why the find_module() method
returns an object (and I guess now is a good time to explain
this <wink>; it also answers a question from Guido he asked
in private mail). zipimporter returns self, but you don't
have to do that: you can return anything with a
load_module() method. So all state you gather during
find_module() you can simply pack in a different object and
return that. Something like this

class MyHook:
    ...
    def find_module(self, fullname):
         info = self._find_module(fullname)
         if info is None:
             return None
         return Loader(info)

class Loader:
    def __init__(self, info):
        self.info = info
    def load_module(self, fullname):
        ...load module using self.info...

A good showcase for this pattern would be a wrapper for the
builtin import mechanism: it could stuff the (file,
filename, (suffix, mode, type)) tuple that imp.find_module()
returns in a loader object, whose load_module() method would
simply call imp.load_module() with that stuff. (I think such
a wrapper is needed in import.c if we want to properly
expose the new hooks in the imp module.)

You can see the importer protocol as a further abstraction
(and OO-ification) of the current
imp.find_module()/imp.load_module() calls.

----------------------------------------------------------------------

Comment By: Paul Moore (pmoore)
Date: 2002-12-14 20:39

Message:
Logged In: YES 
user_id=113328

I have some concerns about a design issue.

I've been playing about writing my own hooks in Python, and I 
have found that for most of my tests, the find_module method 
is simply not useful. I understand that it can be cheaper 
(sometimes a lot cheaper) to just check for the existence of a 
module than actually loading up the code, etc, etc. So in 
those cases, find_module() offers a shortcut if the module 
isn't found.

But if the module *is* found, and find_module has to do 
significant work, then all this work has to be repeated at 
load_module time. There's no mechanism for the result of 
find_module (say, a pointer to the module in a zip directory, 
or whatever) to be passed to load_module. So load_module 
has no alternative than to redo the work (short of the hook 
implementing some form of internal cache mechanism, which 
could be quite messy to do).

I don't know what the likely uses of hooks will be. So I've no 
idea if find_module will turn out in practice to be a cheap 
operation. But I can certainly imagine cases (the 
hypthetical "load a module from a URL" hook) where 
find_module could be far from cheap.

Is it possible to let find_module be optional? If the hook 
implements it, let it be used as a quick check for existence. If 
the hook doesn't implement it, just go for it - call load_module 
and be prepared for a "can't find it" response.

This probably means that more extensive changes to import.c 
will be needed.  But once we expose the design, we're going 
to be stuck with backward compatibility issues forever, so it's 
important we get it right *now*, rather than later. (Although 
having find_module mandatory initially, but made optional 
later, is probably less problematic than most other forms of 
design change...)

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-13 23:44

Message:
Logged In: YES 
user_id=92689

The previous version requires it and failed if they weren't,
the present version still requires it, but missing entries
are added in the read_directory() function (around line
542). So it's guaranteed they're there if needed.

I could have solved it differently at import time, but I
thought it be better to do slightly more work at
zipdirectory-read time so we have to do less work later. I
doubt it matters much in reality, this was just the simplest
way to solve the problem...

(I'll update the test to reflect the solution of this problem.)

----------------------------------------------------------------------

Comment By: Paul Moore (pmoore)
Date: 2002-12-13 22:20

Message:
Logged In: YES 
user_id=113328

I'd say it's pretty much guaranteed *not* to be a good idea. I 
am fairly certain that at least some Windows zip utilities fail 
to have a separate zip entry for the directory itself.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-12-13 21:02

Message:
Logged In: YES 
user_id=6380

Just: it looks like your latest version *requires* that the
entries for package directories are present. Is that really
a good idea?

----------------------------------------------------------------------

Comment By: James C. Ahlstrom (ahlstromjc)
Date: 2002-12-13 20:55

Message:
Logged In: YES 
user_id=64929

The PEP 273 addition of the standard zip archive name
/usr/local/lib/python23.zip (or Windows name) to sys.path to
support import of the Python standard lib (including
site.py) from a zip archive is missing.  That was the point
of my patch to site.py, getpath.c, getpathp.c, main.c,
pythonrun.c, and sysmodule.c.  I believe those patches can
be used with this import.c if you want the feature.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-13 20:29

Message:
Logged In: YES 
user_id=92689

I've attached a new version that fixes a problem that Thomas
Heller noticed: package imports failed when the zip archive
didn't contain a separate entry for the package directory.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-12-13 18:52

Message:
Logged In: YES 
user_id=6380

Thanks -- I'm reviewing this now (if not too many
distracting email arrives :-).

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-13 00:19

Message:
Logged In: YES 
user_id=92689

Yet another new version:
- zipimport is a builtin module now; includes patches for
PC\config.c and PBbuild\pythoncore.dsp, contributed and
tested by Paul Moore.
- tweaked doc strings in zipimport.
I'll quit for today...

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-12 22:59

Message:
Logged In: YES 
user_id=92689

I've uploaded a slightly updated version: the potential zlib
recursion has been fixed (I was in fact able to trigger it);
I've added a test for it, although it's tricky to do. Also
cleaned up a few exception messages in zipimport.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-12 20:34

Message:
Logged In: YES 
user_id=92689

I've uploaded a fresh version, with GC support added. Tested
to the extent that it doesn't crash ;-)

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-12 17:34

Message:
Logged In: YES 
user_id=92689

I didn't know, thanks! I actually did find -N but it "didn't
work" ;-)

I've attached a new patch (slightly updated as well) as one
file. Also included is Lib/test/output/test_zipimport.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2002-12-12 15:28

Message:
Logged In: YES 
user_id=33168

Just, in case you didn't know, you can do a cvs diff -N to
include new/removed files in a patch.  I think you need to
do a cvs add/remove before -N works though.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-12 15:05

Message:
Logged In: YES 
user_id=92689

I've attached test_zipimport.py, a simple test script, to be
placed in Lib/test/.

----------------------------------------------------------------------

Comment By: Martin v. L�wis (loewis)
Date: 2002-12-12 14:16

Message:
Logged In: YES 
user_id=21627

recursive imports: having foo.zip as the first thing in sys.path, 
and having a compressed zlib.py in there was indeed the 
case I was thinking of (without actually trying).

GC: It is absolutely necessary. If this is not done now, 
somebody will spend days of research five years from now, 
because somebody else thought that invoking .update on this 
files attribute was a good idea. This could be reduced to C-
level documentation if the dictionary was not exposed to 
Python.

builtin: I think it ought to be builtin. It's a small module, it 
does not rely on additional libraries, it is not maintained 
externally, and it reduces bootstrap dependencies to have it 
builtin. OTOH, zlib can't be builtin, as it relies on an 
additional library which may not always be present.

get_long: you are right; the 0xff does make the values 
positive, again. I somehow thought the result might still be 
negative.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-12 13:39

Message:
Logged In: YES 
user_id=92689

Hm, regarding gc again: zipimporter objects can only
*theoretically* be involved in cycles, and only if people
muck with the "files" attribute. As it is, the "files" dict
only contains strings (keys) and tuples (values) which
contain strings and ints. So is it really neccesary?

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-12 13:35

Message:
Logged In: YES 
user_id=92689

Martin:
- Have you actually seen a recursive import of zlib? I don't
see how it's possible to cause a stackoverflow. Hm, maybe if
someone stuffs a zlib.py in a zip archive? I'll add a guard.
- Are you really saying it *should* be a builtin module, or
was that comment supposed to start with "if"? See also
Paul's remark about zlib: if zlib remains a shared lib,
having zipimport as a builtin only helps for non-compressed
archives.
- gc: I'll give this a go (never done it before!)
- get_long() -> it's called with a signed char *, and the
buf[i] & 0xff should take care of the rest, so I'm not sure
I understand. But trust your judgement and changed it in my
working copy.

Thank you very much for the quick reply!

----------------------------------------------------------------------

Comment By: Paul Moore (pmoore)
Date: 2002-12-12 12:23

Message:
Logged In: YES 
user_id=113328

I can provide details on how to patch the Windows build 
process - it's not hard. The question of whether to have 
zipimport as builtin or dynamic should be resolved first (the 
processes are different in the 2 cases).

It should be pointed out that if zipimport is builtin, it still relies 
on a dynamic module (zlib) for importing from compressed 
zips. I don't know whether this affects the decision, though...

----------------------------------------------------------------------

Comment By: Martin v. L�wis (loewis)
Date: 2002-12-12 12:15

Message:
Logged In: YES 
user_id=21627

Some comments:
- zipimport should normally be built as a builtin module, there
  should also be a patch for the Windows build procedure
- zipimporters need to support garbage collection, as they 
can occur in cycles
- there should be a mechanism to prevent a stack overflow in 
case of recursive import of zlib.
- get_long needs to accept an unsigned char*

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2002-12-12 11:37

Message:
Logged In: YES 
user_id=92689

btw. the attached file contains a patch for various files as
well as a new file: zipimporter.c. Place the latter in the
Modules/ directory.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=652586&group_id=5470