Re: [Python-Dev] Import redesign [LONG]

Dec. 3, 1999

      On Thu, 2 Dec 1999, Guido van Rossum wrote:
...
...
Sometime, Greg Stein wrote:
...
...
On Thu, 18 Nov 1999, Guido van Rossum wrote:
...
...
Agreed.  I like some of imputil's features, but I think the API
need to be redesigned.
It what ways? It sounds like you've applied some thought. Do you have any
concrete ideas yet, or "just a feeling" :-)  I'm working through some
changes from JimA right now, and would welcome other suggestions. I think
there may be some outstanding stuff from MAL, but I'm not sure (Marc?)
I actually think that the way the PVM (Python VM) calls the importer
ought to be changed.  Assigning to __builtin__.__import__ is a crock.
The API for __import__ is a crock.
Something like sys.set_import_hook() ?

The other alternative that I see would be to have the C code scan
sys.importers, assuming each are callable objects, and call them with the
appropriate params (e.g. module name). Of course, to move this scanning
into Python would require something like sys.set_import_hook() unless
Python looks for a hard-coded module and entrypoint.
...
...
...
Which APIs are you referring to? The "imp" module? The C functions? The
__import__ and reload builtins?
...
I'm guessing some of imp, the two builtins, and only one or two C
functions.
All of those.
We can provide Python code to provide compatibility for "imp" and the two
hooks. Nothing we can do to the C code, though. I'm not sure what the
import API looks like from C, and whether they could all stay. A brief
glance looks like most could stay.
[ removing any would change Python's API version, which might be "okay" ]
...
...
...
...
- load .py/.pyc/.pyo files and shared libraries from files
No problem. Again, a function is needed for platform-specific loading of
shared libraries.
Is it useful to expose the platform differences?  The current
imp.load_dynamic() should suffice.
This comes up several times throughout this message, and in some off-list
mail Guido and I have exchanged. Namely, "should dynamic loading be part
of the core, or performed via a module?"

I would rather see it become a module, rather than inside the core
(despite the fact that the module would have to be compiled into the
interpreter). I believe this provides more flexibility for people looking
to replace/augment/update/fix dynamic loading on various architectures.
Rather than changing the core, a person can just drop in another module.
The isolation between the core and modules is nicer, aesthetically, to me.

The modules would also be exposing Just Another Importer Function, rather
than a specialized API in the builtin imp module. Also note that it is
easier to keep a module *out* of a Python-based application, than it is to
yank functions out of the core of Python. Frozen apps, embedded apps, etc
could easily leave out dynamic loading.

Are there strict advantages? Not any that I can think of right now (beyond
a bit of ease-of-use mentioned above). It just feels better to me.
...
...
...
...
- sys.path and sys.modules should still exist; sys.path might
have a slightly different meaning
I would suggest that both retain their *exact* meaning. We introduce
sys.importers -- a list of importers to check, in sequence. The first
importer on that list uses sys.path to look for and load modules. The
second importer loads builtins and frozen code (i.e. modules not on
sys.path).
This is looking like the redesign I was looking for.  (Note that
imputil's current chaining is not good since it's impossible to remove
or reorder importers, which I think is a required feature; an explicit
list would solve this.)
The chaining is an aspect of the current, singular import hook that Python
uses. In the past, I've suggested the installation of a "manager" that
maintains a list. sys.importers is similar in practice.

Note that this Manager would be present with the sys.set_import_hook()
scheme, while the Manager is implied if the core scans sys.importers.
...
Actually, the order is the other way around, but by now you should
know that.  It makes sense to have separate ones for builtin and
frozen modules -- these have nothing in common.
Yes, JimA pointed this out. The latest imputil has corrected this.

I combined the builtin and frozen Importers because they were just so
similar. I didn't want to iterate over two Importers when a single one
sufficed quite well.

*shrug* Could go either way, really.
...
There's another issue, which isn't directly addressed by imputil,
although with clever use of inheritance it might be doable.  I'd like
more support for this however.  Quite orthogonally to the issue of
having separate importers, I might want to recognize new extensions.
Correct: while imputil doesn't address this, the standard/default Importer
classes *definitely* can.
...
...
the directory/directories with .isl files are placed.)  This requires
an ugly modification to the _fs_import() function.  (Which should have
been a method, by the way, to make overriding it in a subclass of
PathImporter easier!)
I yanked that code out of the DirectoryImporter so that the PathImporter
could use it. I could see a reorg that creates a FileSystemImporter that
defines the method, and the other two just subclass from that.
...
I've been thinking here along the lines of a strategy where the
standard importer (the one that walks sys.path) has a set of hooks
that define various things it could look for, e.g. .py files, .pyc
files, .so or .dll files.  This list of hooks could be changed to
support looking for .isl files.
Agreed. It should be easy to have a mapping of extension to handler.

One issue: should there be an ordering to the extensions? Exercise for the
reader to alter the data structures...
...
There's an old, subtle issue that could be solved through this as
well: whether or not a .pyc file without a .py file should be accepted
or not.  Long ago (in Python 0.9.8) a .pyc file alone would never be
loaded.  This was changed at the request of a small but vocal minority
of Python developers who wanted to distribute .pyc files without .py
files.  It has occasionally caused frustration because sometimes
developers move .py files around but forget to remove the .pyc files,
and then the .pyc file is silently picked up if it occurs on sys.path
earlier than where the .py was moved to.
I think, "too bad for them."  :-)

Having just a .pyc is a very nice feature. But how can you tell whether it
was meant to be a plain .pyc or a mis-ordered one? To truly resolve that,
you would need to scan the whole path, looking for a .py. However, maybe
somebody put the .pyc there on purpose, to override the .py!

--- begin slightly-off-topic ---

Here is a neat little Bash script that allows you to use a .pyc as a CGI
(to avoid parse overhead). Normally, you can't just drop a .pyc into the
cgi-bin directory because the OS doesn't know how to execute it. Not a
problem, I say... just append your .pyc to the following Bash script and
execute! :-)

#!/bin/bash
exec - 3< $0 ; exec python -c 'import os,marshal ; f = os.fdopen(3, "rb")
; f.readline() ; f.readline() ; f.seek(8, 1) ; _c = marshal.load(f) ; del
os, marshal, f ; exec _c' $@

(the script should be two lines; and no... you can't use readlines(2))

The above script will preserve stdin, stdout, and stderr. If the caller
also use 3< ... well, that got overridden :-)

The script doesn't work on Windows for two reasons, though: 1) Bash, 2)
the "rb" mode followed by readline()

Detailed info at the bottom of http://www.lyra.org/greg/python/

--- end of off-topic ---
...
Having a set of hooks for various extensions would make it possible to
have a default where lone .pyc files are ignored, but where one can
insert a .pyc importer in the list of hooks that does the right thing
here.  (Of course, it may be possible that this whole feature of lone
.pyc files should be replaced since the same need is easily taken care
of by zip importers.
Maybe. I'd still like to see plain .pyc files, but I know I can work
around any change you might make here :-)

(i.e. whatever you'd like to do... go for it)
...
I also want to support (Jim A notwithstanding :-) a feature whereby
different things besides directories can live on sys.path, as long as
they are strings -- these could be added from the PYTHONPATH env
variable.  Every piece of code that I've ever seen that uses sys.path
doesn't care if a directory named in sys.path doesn't exist -- it may
try to stat various files in it, which also don't exist, and as far as
it is concerned that is just an indication that the requested module
doesn't live there.
I'm not in favor of this, but it is more-than-doable. Again: your
discretion...
...
Again, we would have to dissect imputil to support various hooks that
deal with different kind of entities in sys.path.  The default hook
list would consist of a single item that interprets the name as a
directory name; other hooks could support zip files or URLs.  Jack's
"magic cookies" could also be supported nicely through such a
mechanism.
Specifically, the PathImporter would get "dissected" :-). No problem.
...
...
Users can insert/append new importers or alter sys.path as before.
sys.modules continues to record name:module mappings.
Yes.
Note that the interpretation of __file__ could be problematic.  To
what value do you set __file__ for a module loaded from a zip archive?
You don't (certainly in a way that is nice/compatible for modules that
refer to it). This is why I don't like __file__ and __path__. They just
don't make sense in archives or frozen code. Python code that relies on
them will create problems when that code is placed into different
packaging mechanisms.
...
...
...
...
(I wouldn't mind a splitting up of importdl.c into several
platform-specific files, one of which is chosen by the configure
script; but that's a bit of a separate issue.)
Easy enough. The standard importer can select the appropriate
platform-specific module/function to perform the load. i.e. these can move
to Modules/ and be split into a module-per-platform.
Again: what's the advantage of exposing the platform specificity?
See above.
...
...
Probably more support is required from the other end: once it's common
for modules to be imported from zip files, the distutil code needs to
support the creation and installation of such zip files.  Also, there
is a need for the install phase of distutil to communicate the
location of the zip file to the Python installation.
I'm quite confident that something can be designed that would satisfy the
needs here. Something akin to .pth files that a zip importer could read.
...
...
...
...
- Standard import from zip or jar files, in two ways:
(1) an entry on sys.path can be a zip/jar file instead of a directory;
      its contents will be searched for modules or packages
Note that this is what I mention above for distutil support.
...
While this could easily be done, I might argue against it. Old
apps/modules that process sys.path might get confused.
Above I argued that this shouldn't be a problem.
For most code, no, but as Fred mentioned (and I surmise), there are things
out there assuming that sys.path contains strings which specify
directories.

Sure, we can do this (your discretion), but my feeling is to avoid it.
...
...
If compatibility is not an issue, then "No problem."
An alternative would be an Importer instance added to sys.importers that
is configured for a specific archive (in other words, don't add the zip
file to sys.path, add ZipImporter(file) to sys.importers).
This would be harder for distutil: where does Python get the initial
list of importers?
Default is just the two: BuiltinImporter and PathImporter. Adding
ZipImporters (or anything else) at startup is TBD, but shouldn't pose a
problem.
...
...
...
...
(2) a file in a directory that's on sys.path can be a zip/jar file;
      its contents will be considered as a package (note that this is
      different from (1)!)
No problem. This will slow things down, as a stat() for *.zip and/or *.jar
must be done, in addition to *.py, *.pyc, and *.pyo.
Fine, this is where the caching comes in handy.
IFF caching is enabled for the particular platform and installation.
...
...
...
The Importer class is already designed for subclassing (and its interface 
is very narrow, which means delegation is also *very* easy; see
imputil.FuncImporter).
But maybe it's *too* narrow; some of the hooks I suggest above seem to
require extra interfaces -- at least in some of the subclasses of the
Importer base class.
Correct -- the *subclasses*. I still maintain the imputil design of a
single hook (get_code) is Right.

I'll make a swipe at PathImporter in the next few weeks to add the
capability for new extensions.
...
Note: I looked at the doc string for get_code() and I don't understand
what the difference is between the modname and fqname arguments.  If I
write "import foo.bar", what are modname and fqname?  Why are both
present?  Also, while you claim that the API is narrow, the multiple
return values (also the different types for the second item) make it
complicated.
Gordon detailed this in another note...

Yes, the multiple return values make it a bit more complicated, but I
can't think of any reasonable alternatives.

A bit more doc should do the trick, I'd guess.
...
...
...
...
- a hook to auto-generate .py files from other filename
    extensions (as currently implemented by ILU)
No problem at all.
See above -- I think this should be more integrated with sys.path than
you are thinking of.  The more I think about it, the more I see that
the problem is that for you, the importer that uses sys.path is a
final subclass of Importer (i.e. it is itself not further subclassed).
Several of the hooks I want seem to require additional hooks in the
PathImporter rather than new importers.
Correct -- I've currently designed/implemented PathImporter as "final".

I don't forsee a problem turning it into something that can be hooked at
run-time, or subclassed at code-time. A detailing of the features needed 
would be handy:

* allow alternative file suffixes, with functions or subclasses to map the
  file into a code/module object.
...
...
...
...
- Note that different kinds of hooks should (ideally, and within
  reason) properly combine, as follows: if I write a hook to recognize
  .spam files and automatically translate them into .py files, and you
  write a hook to support a new archive format, then if both hooks are
  installed together, it should be possible to find a .spam file in an
  archive and do the right thing, without any extra action.  Right?
Ack. Very, very difficult.
Actually, I take most of this back.  Importers that deal with new
extension types often have to go through a file system to transform
their data to .py files, and this is just too complicated.  However it
would be still nice if there was code sharing between the code that
looks for .py and .pyc files in a zip archive and the code that does
the same in a filesystem.  Hm, maybe even that shouldn't be necessary,
the zip file probably should contain only .pyc files...
Gordon replies to this... All of the archives that myself, Gordon, and
JimA have been using only store .pyc files. I don't see much code sharing
between the filesystem and archive import code.
...
...
...
All is not lost, however. I can easily envision the get_code() hook as
allowing any kind of return type. If it isn't a code or module object,
then another hook is called to transform it.
[ actually, I'd design it similarly: a *series* of hooks would be called
  until somebody transforms the foo.spam into a code/module object. ]
OK.  This could be a feature of a subclass of Importer.
That would be my preference, rather than loading more into the Importer
base class itself.
...
...
...
...
- It should be possible to write hooks in C/C++ as well as Python
Use FuncImporter to delegate to an extension module.
Maybe not so great, since it sounds like the C code can't benefit from
any of the infrastructure that imputil offers.  I'm not sure about
this one though.
There isn't any infrastructure that needs to be accessed. get_code() is
the call-point, and there is no mechanism provided to the callee to call
back into the imputil system.
...
...
This is one of the benefits of imputil's single/narrow interface.
Plus its vague specs? :-)
Ouch. I thought I was actually doing quite a bit better than normal with
that long doc-string on get_code :-(
...
...
...
For a restricted execution app, it might install an Importer that loads
files from *one* directory only which is configured from a specific
Win32 Registry entry. That importer could also refuse to load shared
modules. The BuiltinImporter would still be present (although the app
would certainly omit all but the necessary builtins from the build).
Frozen modules could be excluded.
Actually there's little reason to exclude frozen modules or any
.py/.pyc modules -- by definition, bytecode can't be dangerous.  It's
the builtins and extensions that need to be censored.
We currently do this by subclassing ihooks, where we mask the test for
builtins with a comparison to a predefined list of names.
True. My concern is an invader misusing one "type" of module for another.
For example, let's say you've provided a selection of modules each
exporting function FOO, and the user can configure which module to use.
Can they do damage if some unrelated, frozen module also exports FOO?

Minor issue, anyhow. All the functionality is there.
...
...
...
I posited once before that the cost of import is mostly I/O rather than
CPU, so using Python should not be an issue. MAL demonstrated that a good
design for the Importer classes is also required. Based on this, I'm a
*strong* advocate of moving as much as possible into Python (to get
Python's ease-of-coding with little relative cost).
Agreed.  However, how do you explain the slowdown (from 9 to 13
seconds I recall) though?  Are you a lousy coder? :-)
Heh :-)

I have not spent *any* time working on optimization. Currently, each
Importer in the chain redoes some work of the prior Importer. A bit of
restructuring would split the common work out to a Manager, which then
calls a method in the Importer (and passes all the computed work). Of
course, a bit of profiling wouldn't hurt either. Some of the "imp"
interfaces could possibly be refined to better support the BuiltinImporter
or the dynamic load features.

The question is still valid, though -- at the moment, I can't explain it
because I haven't looked into it.
...
...
The (core) C code should be able to search a path for a module and import
it. It does not require dynamic loading or packages. This will be used to
import exceptions.py, then imputil.py, then site.py.
Note: after writing this, I realized there is really no need for the core
to do the imputil import. site.py can easily do that.
...
It does, however, need to import builtin modules.  imputil currently
Correct.
...
imports imp, sys, strop and __builtin__, struct and marshal; note that
struct can easily be a dynamic loadable module, and so could strop in
theory.  (Note that strop will be unnecessary in 1.6 if you use string
methods.)
I knew about strop, but imputil would be harder to use today if it relied
on the string methods. So... I've delayed that change.

The struct module is used in a couple teeny cases, dealing with
constructing a network-order, 4-byte, binary integer value. It would be
easy enough to just do that with a bit of Python code instead.
...
I don't think that this chicken-or-egg problem is particularly
problematic though.
Right.

In my ideal world, the core couldn't do a dynamic load, so that would need
to be considered within the bootstrap process.
...
...
...
site.py can complete the bootstrap by setting up sys.importers with the
appropriate Importer instances (this is where an application can define
its own policy). sys.path was initially set by the import.c bootstrap code
(from the compiled-in path and environment variables).
I thing that algorithm (currently in getpath.c / getpathp.c) might
also be moved to Python code -- imported frozen.  Sadly, rebuilding
with a new version of a frozen module might be more complicated than
rebuilding with a new version of a C module, but writing and
maintaining this code in Python would be *sooooooo* much easier that I
think it's worth it.
I think we can find a better way to freeze modules and to use them.
Especially for the cases where we have specific "core" functions
implemented in Python. (e.g. freezing parsers, compilers, and/or the
read-eval loop)

I don't forsee an issue that the build process becomes more complicated.
If we nuke "makesetup" in favor of a Python script, then we could create a
stub Python executable which runs the build script which writes the Setup
file and the getpath*.c file(s).
...
...
Note that imputil.py would not install any hooks when it is loaded. That
is up to site.py. This implies the core C code will import a total of
three modules using its builtin system. After that, the imputil mechanism
would be importing everything (site.py would .install() an Importer which
then takes over the __import__ hook).
(Three not counting the builtin modules.)
Correct, although I'll modify my statement to "two plus the builtins".
...
...
Further note that the "import" Python statement could be simplified to use
only the hook. However, this would require the core importer to inject
some module names into the imputil module's namespace (since it couldn't
use an import statement until a hook was installed). While this
simplification is "neat", it complicates the run-time system (the import
statement is broken until a hook is installed).
Same chicken-or-egg.  We can be pragmatic.
For a developer, I'd like a bit of robustness (all this makes it
rather hard to debug a broken imputil, and that's a fair amount of
code!).
True. I threw that out as an alternative, and then presented the counter
argument :-)
...
...
...
Therefore, the core C code must also support importing builtins. "sys" and
"imp" are needed by imputil to bootstrap.
The core importer should not need to deal with dynamic-load modules.
Same question.  Since that all has to be coded in C anyway, why not?
It simplifies the core's import code to not deal with that stuff at all.
...
...
To support frozen apps, the core importer would need to support loading
the three modules as frozen modules.
I'd like to see a description of how someone like Jim A would build a
single-file application using the new mechanism.  This could
completely replace freeze.  (Freeze currently requires a C compiler;
that's bad.)
The portable mechanism for freezing will always need a compiler. Platform
specific mechanisms (e.g. append to the .EXE, or use the linker to create
a new ELF section) can optimize the freeze process in different ways.

I don't have a design in my head for the freeze issues -- I've been
considering that the mechanism would remain about the same. However, I can
easily see that different platforms may want to use different freeze
processes... hmm...
...
...
...
Yes. I don't see this as a requirement, though. We wouldn't start to use
these by default, would we? Or insist on zlib being present? I see this as
more along the lines of "we have provided a standardized Importer to do
this, *provided* you have zlib support."
Agreed.  Zlib support is easy to get, but there are probably platforms
where it's not.  (E.g. maybe the Mac?  I suppose that on the Mac,
there would be some importer classes to import from a resource fork.)
Exactly. And importer classes to load from a Win32 resources (modifying a
.EXE's resources post-link is cleaner than the append solution)
...
...
...
My outline above does not freeze anything. Everything resides in the
filesystem. The C code merely needs a path-scanning loop and functions to
import .py*, builtin, and frozen types of modules.
Good.  Though I think there's also a need for freezing everything.
And when we go the route of the zip archive, the zip archive handling
code needs to be somewhere -- frozen seems to be a reasonable choice.
Sure.
...
...
If somebody nukes their imputil.py or site.py, then they return to Python
1.4 behavior where the core interpreter uses a path for importing (i.e. no
packages). They lose dynamically-loaded module support.
But if the path guessing is also done by site.py (as I propose) the
path will probably be wrong.  A warning should be printed.
All right. Doesn't Python already print a warning if it can't find
site.py?
...
...
...
Let's first complete the requirements gathering.  Are these
requirements reasonable?  Will they make an implementation too
complex?  Am I missing anything?
I'm not a fan of the compositing due to it requiring a change to semantics
that I believe are very useful and very clean. However, I outlined a
possible, clean solution to do that (a secondary set of hooks for
transforming get_code() return values).
As you may see from my responses, I'm a big fan of having several
different sets of hooks.
Yes. However, I've only recognized one so far. Propose more... I'm
confident we can update the PathImporter design to accomodate (and retain
the underlying imputil paradigm).
...
I do withdraw the composition requirement
though.
:-)
...
...
...
Once you hit site.py, you have a "full" environment and can easily detect
and import a read-eval-print loop module (i.e. why return to Python? just 
start things up right there).
You mean "why return to C?"  I agree.  It would be cool if somehow
Heh. Yah, that's what I meant :-)
...
IDLE and Pythonwin would also be bootstrapped using the same
mechanisms.  (This would also solve the question "which interactive
environment am I using?" that some modules and apps want to see
answered because they need to do things differently when run under
IDLE,for example.)
Haven't thought on this. Should be doable, I'd think.
...
...
site.py can also install new optimizers as desired, a new Python-based
parser or compiler, or whatever...  If Python is built without a parser or
compiler (I hope that's an option!), then the three startup modules would
simply be frozen into the executable.
More power to hooks!
:-) You betcha!

I believe my next order of business:

* update PathImporter with the file-extension hook
* dynload C code reorg, per the other email
* create new-model site.py and trash import.c
* review freeze mechanisms and process
* design mechanism for frozen core functionality (eg. getpath*.c)
  (coding and building design)
* shift core functions to Python, using above design

I'll just plow ahead, but also recognize that any/all may change. ie. I'll
build examples/finals/prototypes and Guido can pick/choose/reimplement/etc
as needed. I'm out next week, but should start on the above items by the
end of the month (will probably do another mod_dav release in there
somewhere).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/