[Python-Dev] PEP 273: Import Modules from Zip Archives

Guido van Rossum guido@python.org
Fri, 26 Oct 2001 16:34:15 -0400


> The PEP for zip import is 273.  Please take a look and comment.
> 
>         http://python.sourceforge.net/peps/pep-0273.html

OK, I'll shoot.  But I expect Gordon McMillan and Greg Stein to
provide more useful feedback.

|     Currently, sys.path is a list of directory names as strings.  If
|     this PEP is implemented, an item of sys.path can be a string
|     naming a zip file archive.  The zip archive can contain a
|     subdirectory structure to support package imports.  The zip
|     archive satisfies imports exactly as a subdirectory would.

I like this.

|     The implementation is in C code in the Python core and works on
|     all supported Python platforms.

This is good too, as it provides a bootstrap.  OTOH I also would like
to see a prototype in Python, using either ihooks or imputil.

|     Any files may be present in the zip archive, but only files *.pyc,
|     *.pyo and __init__.py[co] are available for import.  Zip import of
|     *.py and dynamic modules (*.pyd, *.so) is disallowed.
| 
|     Just as sys.path currently has default directory names, default
|     zip archive names are added too.  Otherwise there is no way to
|     import all Python library files from an archive.

More bootstrap goodness.

|     Reading compressed zip archives requires the zlib module.  An
|     import of zlib will be attempted prior to any other imports.  If
|     zlib is not available at that time, only uncompressed archives
|     will be readable, even if zlib subsequently becomes available.

Hm, I wonder if we couldn't just link with the libz.a C library and
use the C interface, if you're implementing this in C anyway.

| Subdirectory Equivalence
| 
|     The zip archive must be treated exactly as a subdirectory tree so
|     we can support package imports based on current and future rules.
|     Zip archive files must be created with relative path names.  That
|     is, archive file names are of the form: file1, file2, dir1/file3,
|     dir2/dir3/file4.
| 
|     Suppose sys.path contains "/A/B/SubDir" and "/C/D/E/Archive.zip",
|     and we are trying to import modfoo from the Q package.  Then
|     import.c will generate a list of paths and extensions and will
|     look for the file.  The list of generated paths does not change
|     for zip imports.

(Very clever.)

                        Suppose import.c generates the path
|     "/A/B/SubDir/Q/R/modfoo.pyc".  Then it will also generate the path
|     "/C/D/E/Archive.zip/Q/R/modfoo.pyc".  Finding the SubDir path is
|     exactly equivalent to finding "Q/R/modfoo.pyc" in the archive.

Nice.

|     Suppose you zip up /A/B/SubDir/* and all its subdirectories.  Then
|     your zip file will satisfy imports just as your subdirectory did.
| 
|     Well, not quite.  You can't satisfy dynamic modules from a zip
|     file.  Dynamic modules have extensions like .dll, .pyd, and .so.
|     They are operating system dependent, and probably can't be loaded
|     except from a file.  It might be possible to extract the dynamic
|     module from the zip file, write it to a plain file and load it.
|     But that would mean creating temporary files, and dealing with all
|     the dynload_*.c, and that's probably not a good idea.

Agreed.

|     You also can't import source files *.py from a zip archive.  The
|     problem here is what to do with the compiled files.  Python would
|     normally write these to the same directory as *.py, but surely we
|     don't want to write to the zip file.  We could write to the
|     directory of the zip archive, but that would clutter it up, not
|     good if it is /usr/bin for example.  We could just fail to write
|     the compiled files, but that makes zip imports very slow, and the
|     user would probably not figure out what is wrong.  It is probably
|     best for users to put *.pyc into zip archives in the first place,
|     and this PEP enforces that rule.

I agree.  But it would still be good if the .py files were also in the
zip file, so the source can be used in tracebacks etc.  A C API to get
a source line from a filename might be a good idea (plus a Python API).

|     So the only imports zip archives support are *.pyc and *.pyo, plus
|     the import of __init__.py[co] for packages, and the search of the
|     subdirectory structure for the same.

I wonder if we need to make an additional rule that allows a .pyc file
to satisfy a module request even if we're in optimized mode (where
normally only .pyo files are searched).  Otherwise, if someone ships a
zipfile with only .pyc files, their modules can't be imported at all
when python -O is used.


| Efficiency
| 
|     The only way to find files in a zip archive is linear search.

But there's an index record at the end that provides quick access.

                                                                     So
|     for each zip file in sys.path, we search for its names once, and
|     put the names plus other relevant data into a static Python
|     dictionary.  The key is the archive name from sys.path joined with
|     the file name (including any subdirectories) within the archive.
|     This is exactly the name generated by import.c, and makes lookup
|     easy.

We could do this kind of pre-scanning for regular dictionaries on
sys.path too.  I found out very long ago (around '93 or '94) that this
saves a *lot* of startup time; I presume it still does.  (And even
more if the info can be cached in a file.)  The only problem is how to
detect when the cache becomes out of date.  Of course, you could say
"if you want faster startup time, put all your files in a zip
archive", and I couldn't really argue with that. :-)


| zlib
| 
|     Compressed zip archives require zlib for decompression.  Prior to
|     any other imports, we attempt an import of zlib, and set a flag if
|     it is available.  All compressed files are invisible unless this
|     flag is true.

Do we get an "module not found" error or something better, like
"compressed module found as <filename> but zlib unavailable"?

|     It could happen that zlib was available later.  For example, the
|     import of site.py might add the correct directory to sys.path so a
|     dynamic load succeeds.  But compressed files will still be
|     invisible.  It is unknown if it can happen that importing site.py
|     can cause zlib to appear, so maybe we're worrying about nothing.
|     On Windows and Linux, the early import of zlib succeeds without
|     site.py.

Yes, site.py isn't needed to make standard library modules available;
it's intended to make non-standare library modules available. :-)

|     The problem here is the confusion caused by the reverse.  Either a
|     zip file satisfies imports or it doesn't.  It is silly to say that
|     site.py needs to be uncompressed, and that maybe imports will
|     succeed later.  If you don't like this, create uncompressed zip
|     archives or make sure zlib is available, for example, as a
|     built-in module.  Or we can write special search logic during zip
|     initialization.

I don't think we need anything special here.  site.py shouldn't be
needed.


| Booting
| 
|     Python imports site.py itself, and this imports os, nt, ntpath,
|     stat, and UserDict.  It also imports sitecustomize.py which may
|     import more modules.  Zip imports must be available before site.py
|     is imported.
| 
|     Just as there are default directories in sys.path, there must be
|     one or more default zip archives too.
| 
|     The problem is what the name should be.  The name should be linked
|     with the Python version, so the Python executable can correctly
|     find its corresponding libraries even when there are multiple
|     Python versions on the same machine.
| 
|     This PEP suggests a zip archive name equal to the Python
|     interpreter path with extension ".zip" (eg, /usr/bin/python.zip)
|     which is always prepended to sys.path.  So a directory with python
|     and python.zip is complete.  This would work fine on Windows, as
|     it is common to put supporting files in the directory of the
|     executable.  But it may offend Unix fans, who dislike bin
|     directories being used for libraries.  It might be fine to
|     generate different defaults for Windows and Unix if necessary, but
|     the code will be in C, and there is no sense getting complicated.


Well, this is the domain of getpath.c, and that's got a different
implementation for Unix and Windows anyway (Windows has PC/getpathp.c).


| Implementation
| 
|     A C implementation exists which works, but which can be made better.

Upload as a patch please?

--Guido van Rossum (home page: http://www.python.org/~guido/)