[Python-Dev] Import redesign [LONG]

Greg Stein gstein@lyra.org
Thu, 2 Dec 1999 19:19:40 -0800 (PST)


On Thu, 2 Dec 1999, Guido van Rossum wrote:
>...
> Sometime, Greg Stein wrote:
>...
> > On Thu, 18 Nov 1999, Guido van Rossum wrote:
>...
> > > Agreed.  I like some of imputil's features, but I think the API
> > > need to be redesigned.
> > 
> > It what ways? It sounds like you've applied some thought. Do you have any
> > concrete ideas yet, or "just a feeling" :-)  I'm working through some
> > changes from JimA right now, and would welcome other suggestions. I think
> > there may be some outstanding stuff from MAL, but I'm not sure (Marc?)
> 
> I actually think that the way the PVM (Python VM) calls the importer
> ought to be changed.  Assigning to __builtin__.__import__ is a crock.
> The API for __import__ is a crock.

Something like sys.set_import_hook() ?

The other alternative that I see would be to have the C code scan
sys.importers, assuming each are callable objects, and call them with the
appropriate params (e.g. module name). Of course, to move this scanning
into Python would require something like sys.set_import_hook() unless
Python looks for a hard-coded module and entrypoint.

>...
> > Which APIs are you referring to? The "imp" module? The C functions? The
> > __import__ and reload builtins?
> 
> > I'm guessing some of imp, the two builtins, and only one or two C
> > functions.
> 
> All of those.

We can provide Python code to provide compatibility for "imp" and the two
hooks. Nothing we can do to the C code, though. I'm not sure what the
import API looks like from C, and whether they could all stay. A brief
glance looks like most could stay.
[ removing any would change Python's API version, which might be "okay" ]

>...
> > > - load .py/.pyc/.pyo files and shared libraries from files
> > 
> > No problem. Again, a function is needed for platform-specific loading of
> > shared libraries.
> 
> Is it useful to expose the platform differences?  The current
> imp.load_dynamic() should suffice.

This comes up several times throughout this message, and in some off-list
mail Guido and I have exchanged. Namely, "should dynamic loading be part
of the core, or performed via a module?"

I would rather see it become a module, rather than inside the core
(despite the fact that the module would have to be compiled into the
interpreter). I believe this provides more flexibility for people looking
to replace/augment/update/fix dynamic loading on various architectures.
Rather than changing the core, a person can just drop in another module.
The isolation between the core and modules is nicer, aesthetically, to me.

The modules would also be exposing Just Another Importer Function, rather
than a specialized API in the builtin imp module. Also note that it is
easier to keep a module *out* of a Python-based application, than it is to
yank functions out of the core of Python. Frozen apps, embedded apps, etc
could easily leave out dynamic loading.

Are there strict advantages? Not any that I can think of right now (beyond
a bit of ease-of-use mentioned above). It just feels better to me.

>...
> > > - sys.path and sys.modules should still exist; sys.path might
> > > have a slightly different meaning
> > 
> > I would suggest that both retain their *exact* meaning. We introduce
> > sys.importers -- a list of importers to check, in sequence. The first
> > importer on that list uses sys.path to look for and load modules. The
> > second importer loads builtins and frozen code (i.e. modules not on
> > sys.path).
> 
> This is looking like the redesign I was looking for.  (Note that
> imputil's current chaining is not good since it's impossible to remove
> or reorder importers, which I think is a required feature; an explicit
> list would solve this.)

The chaining is an aspect of the current, singular import hook that Python
uses. In the past, I've suggested the installation of a "manager" that
maintains a list. sys.importers is similar in practice.

Note that this Manager would be present with the sys.set_import_hook()
scheme, while the Manager is implied if the core scans sys.importers.

> Actually, the order is the other way around, but by now you should
> know that.  It makes sense to have separate ones for builtin and
> frozen modules -- these have nothing in common.

Yes, JimA pointed this out. The latest imputil has corrected this.

I combined the builtin and frozen Importers because they were just so
similar. I didn't want to iterate over two Importers when a single one
sufficed quite well.

*shrug* Could go either way, really.

> There's another issue, which isn't directly addressed by imputil,
> although with clever use of inheritance it might be doable.  I'd like
> more support for this however.  Quite orthogonally to the issue of
> having separate importers, I might want to recognize new extensions.

Correct: while imputil doesn't address this, the standard/default Importer
classes *definitely* can.

>...
> the directory/directories with .isl files are placed.)  This requires
> an ugly modification to the _fs_import() function.  (Which should have
> been a method, by the way, to make overriding it in a subclass of
> PathImporter easier!)

I yanked that code out of the DirectoryImporter so that the PathImporter
could use it. I could see a reorg that creates a FileSystemImporter that
defines the method, and the other two just subclass from that.

> I've been thinking here along the lines of a strategy where the
> standard importer (the one that walks sys.path) has a set of hooks
> that define various things it could look for, e.g. .py files, .pyc
> files, .so or .dll files.  This list of hooks could be changed to
> support looking for .isl files.

Agreed. It should be easy to have a mapping of extension to handler.

One issue: should there be an ordering to the extensions? Exercise for the
reader to alter the data structures...

> There's an old, subtle issue that could be solved through this as
> well: whether or not a .pyc file without a .py file should be accepted
> or not.  Long ago (in Python 0.9.8) a .pyc file alone would never be
> loaded.  This was changed at the request of a small but vocal minority
> of Python developers who wanted to distribute .pyc files without .py
> files.  It has occasionally caused frustration because sometimes
> developers move .py files around but forget to remove the .pyc files,
> and then the .pyc file is silently picked up if it occurs on sys.path
> earlier than where the .py was moved to.

I think, "too bad for them."  :-)

Having just a .pyc is a very nice feature. But how can you tell whether it
was meant to be a plain .pyc or a mis-ordered one? To truly resolve that,
you would need to scan the whole path, looking for a .py. However, maybe
somebody put the .pyc there on purpose, to override the .py!

--- begin slightly-off-topic ---

Here is a neat little Bash script that allows you to use a .pyc as a CGI
(to avoid parse overhead). Normally, you can't just drop a .pyc into the
cgi-bin directory because the OS doesn't know how to execute it. Not a
problem, I say... just append your .pyc to the following Bash script and
execute! :-)

#!/bin/bash
exec - 3< $0 ; exec python -c 'import os,marshal ; f = os.fdopen(3, "rb")
; f.readline() ; f.readline() ; f.seek(8, 1) ; _c = marshal.load(f) ; del
os, marshal, f ; exec _c' $@

(the script should be two lines; and no... you can't use readlines(2))

The above script will preserve stdin, stdout, and stderr. If the caller
also use 3< ... well, that got overridden :-)

The script doesn't work on Windows for two reasons, though: 1) Bash, 2)
the "rb" mode followed by readline()

Detailed info at the bottom of http://www.lyra.org/greg/python/

--- end of off-topic ---

> Having a set of hooks for various extensions would make it possible to
> have a default where lone .pyc files are ignored, but where one can
> insert a .pyc importer in the list of hooks that does the right thing
> here.  (Of course, it may be possible that this whole feature of lone
> .pyc files should be replaced since the same need is easily taken care
> of by zip importers.

Maybe. I'd still like to see plain .pyc files, but I know I can work
around any change you might make here :-)

(i.e. whatever you'd like to do... go for it)

> I also want to support (Jim A notwithstanding :-) a feature whereby
> different things besides directories can live on sys.path, as long as
> they are strings -- these could be added from the PYTHONPATH env
> variable.  Every piece of code that I've ever seen that uses sys.path
> doesn't care if a directory named in sys.path doesn't exist -- it may
> try to stat various files in it, which also don't exist, and as far as
> it is concerned that is just an indication that the requested module
> doesn't live there.

I'm not in favor of this, but it is more-than-doable. Again: your
discretion...

> Again, we would have to dissect imputil to support various hooks that
> deal with different kind of entities in sys.path.  The default hook
> list would consist of a single item that interprets the name as a
> directory name; other hooks could support zip files or URLs.  Jack's
> "magic cookies" could also be supported nicely through such a
> mechanism.

Specifically, the PathImporter would get "dissected" :-). No problem.

> > Users can insert/append new importers or alter sys.path as before.
> > 
> > sys.modules continues to record name:module mappings.
> 
> Yes.
> 
> Note that the interpretation of __file__ could be problematic.  To
> what value do you set __file__ for a module loaded from a zip archive?

You don't (certainly in a way that is nice/compatible for modules that
refer to it). This is why I don't like __file__ and __path__. They just
don't make sense in archives or frozen code. Python code that relies on
them will create problems when that code is placed into different
packaging mechanisms.

>...
> > > (I wouldn't mind a splitting up of importdl.c into several
> > > platform-specific files, one of which is chosen by the configure
> > > script; but that's a bit of a separate issue.)
> > 
> > Easy enough. The standard importer can select the appropriate
> > platform-specific module/function to perform the load. i.e. these can move
> > to Modules/ and be split into a module-per-platform.
> 
> Again: what's the advantage of exposing the platform specificity?

See above.

>...
> Probably more support is required from the other end: once it's common
> for modules to be imported from zip files, the distutil code needs to
> support the creation and installation of such zip files.  Also, there
> is a need for the install phase of distutil to communicate the
> location of the zip file to the Python installation.

I'm quite confident that something can be designed that would satisfy the
needs here. Something akin to .pth files that a zip importer could read.

>...
> > > - Standard import from zip or jar files, in two ways:
> > > 
> > >   (1) an entry on sys.path can be a zip/jar file instead of a directory;
> > >       its contents will be searched for modules or packages
> 
> Note that this is what I mention above for distutil support.
> 
> > While this could easily be done, I might argue against it. Old
> > apps/modules that process sys.path might get confused.
> 
> Above I argued that this shouldn't be a problem.

For most code, no, but as Fred mentioned (and I surmise), there are things
out there assuming that sys.path contains strings which specify
directories.

Sure, we can do this (your discretion), but my feeling is to avoid it.

> > If compatibility is not an issue, then "No problem."
> > 
> > An alternative would be an Importer instance added to sys.importers that
> > is configured for a specific archive (in other words, don't add the zip
> > file to sys.path, add ZipImporter(file) to sys.importers).
> 
> This would be harder for distutil: where does Python get the initial
> list of importers?

Default is just the two: BuiltinImporter and PathImporter. Adding
ZipImporters (or anything else) at startup is TBD, but shouldn't pose a
problem.

>...
> > >   (2) a file in a directory that's on sys.path can be a zip/jar file;
> > >       its contents will be considered as a package (note that this is
> > >       different from (1)!)
> > 
> > No problem. This will slow things down, as a stat() for *.zip and/or *.jar
> > must be done, in addition to *.py, *.pyc, and *.pyo.
> 
> Fine, this is where the caching comes in handy.

IFF caching is enabled for the particular platform and installation.

>...
> > The Importer class is already designed for subclassing (and its interface 
> > is very narrow, which means delegation is also *very* easy; see
> > imputil.FuncImporter).
> 
> But maybe it's *too* narrow; some of the hooks I suggest above seem to
> require extra interfaces -- at least in some of the subclasses of the
> Importer base class.

Correct -- the *subclasses*. I still maintain the imputil design of a
single hook (get_code) is Right.

I'll make a swipe at PathImporter in the next few weeks to add the
capability for new extensions.

> Note: I looked at the doc string for get_code() and I don't understand
> what the difference is between the modname and fqname arguments.  If I
> write "import foo.bar", what are modname and fqname?  Why are both
> present?  Also, while you claim that the API is narrow, the multiple
> return values (also the different types for the second item) make it
> complicated.

Gordon detailed this in another note...

Yes, the multiple return values make it a bit more complicated, but I
can't think of any reasonable alternatives.

A bit more doc should do the trick, I'd guess.

>...
> > >   - a hook to auto-generate .py files from other filename
> > >     extensions (as currently implemented by ILU)
> > 
> > No problem at all.
> 
> See above -- I think this should be more integrated with sys.path than
> you are thinking of.  The more I think about it, the more I see that
> the problem is that for you, the importer that uses sys.path is a
> final subclass of Importer (i.e. it is itself not further subclassed).
> Several of the hooks I want seem to require additional hooks in the
> PathImporter rather than new importers.

Correct -- I've currently designed/implemented PathImporter as "final".

I don't forsee a problem turning it into something that can be hooked at
run-time, or subclassed at code-time. A detailing of the features needed 
would be handy:

* allow alternative file suffixes, with functions or subclasses to map the
  file into a code/module object.

>...
> > > - Note that different kinds of hooks should (ideally, and within
> > >   reason) properly combine, as follows: if I write a hook to recognize
> > >   .spam files and automatically translate them into .py files, and you
> > >   write a hook to support a new archive format, then if both hooks are
> > >   installed together, it should be possible to find a .spam file in an
> > >   archive and do the right thing, without any extra action.  Right?
> > 
> > Ack. Very, very difficult.
> 
> Actually, I take most of this back.  Importers that deal with new
> extension types often have to go through a file system to transform
> their data to .py files, and this is just too complicated.  However it
> would be still nice if there was code sharing between the code that
> looks for .py and .pyc files in a zip archive and the code that does
> the same in a filesystem.  Hm, maybe even that shouldn't be necessary,
> the zip file probably should contain only .pyc files...

Gordon replies to this... All of the archives that myself, Gordon, and
JimA have been using only store .pyc files. I don't see much code sharing
between the filesystem and archive import code.

>...
> > All is not lost, however. I can easily envision the get_code() hook as
> > allowing any kind of return type. If it isn't a code or module object,
> > then another hook is called to transform it.
> > [ actually, I'd design it similarly: a *series* of hooks would be called
> >   until somebody transforms the foo.spam into a code/module object. ]
> 
> OK.  This could be a feature of a subclass of Importer.

That would be my preference, rather than loading more into the Importer
base class itself.

>...
> > > - It should be possible to write hooks in C/C++ as well as Python
> > 
> > Use FuncImporter to delegate to an extension module.
> 
> Maybe not so great, since it sounds like the C code can't benefit from
> any of the infrastructure that imputil offers.  I'm not sure about
> this one though.

There isn't any infrastructure that needs to be accessed. get_code() is
the call-point, and there is no mechanism provided to the callee to call
back into the imputil system.

> > This is one of the benefits of imputil's single/narrow interface.
> 
> Plus its vague specs? :-)

Ouch. I thought I was actually doing quite a bit better than normal with
that long doc-string on get_code :-(

>...
> > For a restricted execution app, it might install an Importer that loads
> > files from *one* directory only which is configured from a specific
> > Win32 Registry entry. That importer could also refuse to load shared
> > modules. The BuiltinImporter would still be present (although the app
> > would certainly omit all but the necessary builtins from the build).
> > Frozen modules could be excluded.
> 
> Actually there's little reason to exclude frozen modules or any
> .py/.pyc modules -- by definition, bytecode can't be dangerous.  It's
> the builtins and extensions that need to be censored.
> 
> We currently do this by subclassing ihooks, where we mask the test for
> builtins with a comparison to a predefined list of names.

True. My concern is an invader misusing one "type" of module for another.
For example, let's say you've provided a selection of modules each
exporting function FOO, and the user can configure which module to use.
Can they do damage if some unrelated, frozen module also exports FOO?

Minor issue, anyhow. All the functionality is there.

>...
> > I posited once before that the cost of import is mostly I/O rather than
> > CPU, so using Python should not be an issue. MAL demonstrated that a good
> > design for the Importer classes is also required. Based on this, I'm a
> > *strong* advocate of moving as much as possible into Python (to get
> > Python's ease-of-coding with little relative cost).
> 
> Agreed.  However, how do you explain the slowdown (from 9 to 13
> seconds I recall) though?  Are you a lousy coder? :-)

Heh :-)

I have not spent *any* time working on optimization. Currently, each
Importer in the chain redoes some work of the prior Importer. A bit of
restructuring would split the common work out to a Manager, which then
calls a method in the Importer (and passes all the computed work). Of
course, a bit of profiling wouldn't hurt either. Some of the "imp"
interfaces could possibly be refined to better support the BuiltinImporter
or the dynamic load features.

The question is still valid, though -- at the moment, I can't explain it
because I haven't looked into it.

> > The (core) C code should be able to search a path for a module and import
> > it. It does not require dynamic loading or packages. This will be used to
> > import exceptions.py, then imputil.py, then site.py.

Note: after writing this, I realized there is really no need for the core
to do the imputil import. site.py can easily do that.

> It does, however, need to import builtin modules.  imputil currently

Correct.

> imports imp, sys, strop and __builtin__, struct and marshal; note that
> struct can easily be a dynamic loadable module, and so could strop in
> theory.  (Note that strop will be unnecessary in 1.6 if you use string
> methods.)

I knew about strop, but imputil would be harder to use today if it relied
on the string methods. So... I've delayed that change.

The struct module is used in a couple teeny cases, dealing with
constructing a network-order, 4-byte, binary integer value. It would be
easy enough to just do that with a bit of Python code instead.

> I don't think that this chicken-or-egg problem is particularly
> problematic though.

Right.

In my ideal world, the core couldn't do a dynamic load, so that would need
to be considered within the bootstrap process.

>...
> > site.py can complete the bootstrap by setting up sys.importers with the
> > appropriate Importer instances (this is where an application can define
> > its own policy). sys.path was initially set by the import.c bootstrap code
> > (from the compiled-in path and environment variables).
> 
> I thing that algorithm (currently in getpath.c / getpathp.c) might
> also be moved to Python code -- imported frozen.  Sadly, rebuilding
> with a new version of a frozen module might be more complicated than
> rebuilding with a new version of a C module, but writing and
> maintaining this code in Python would be *sooooooo* much easier that I
> think it's worth it.

I think we can find a better way to freeze modules and to use them.
Especially for the cases where we have specific "core" functions
implemented in Python. (e.g. freezing parsers, compilers, and/or the
read-eval loop)

I don't forsee an issue that the build process becomes more complicated.
If we nuke "makesetup" in favor of a Python script, then we could create a
stub Python executable which runs the build script which writes the Setup
file and the getpath*.c file(s).

> > Note that imputil.py would not install any hooks when it is loaded. That
> > is up to site.py. This implies the core C code will import a total of
> > three modules using its builtin system. After that, the imputil mechanism
> > would be importing everything (site.py would .install() an Importer which
> > then takes over the __import__ hook).
> 
> (Three not counting the builtin modules.)

Correct, although I'll modify my statement to "two plus the builtins".

> > Further note that the "import" Python statement could be simplified to use
> > only the hook. However, this would require the core importer to inject
> > some module names into the imputil module's namespace (since it couldn't
> > use an import statement until a hook was installed). While this
> > simplification is "neat", it complicates the run-time system (the import
> > statement is broken until a hook is installed).
> 
> Same chicken-or-egg.  We can be pragmatic.
> 
> For a developer, I'd like a bit of robustness (all this makes it
> rather hard to debug a broken imputil, and that's a fair amount of
> code!).

True. I threw that out as an alternative, and then presented the counter
argument :-)

>...
> > Therefore, the core C code must also support importing builtins. "sys" and
> > "imp" are needed by imputil to bootstrap.
> > 
> > The core importer should not need to deal with dynamic-load modules.
> 
> Same question.  Since that all has to be coded in C anyway, why not?

It simplifies the core's import code to not deal with that stuff at all.

> > To support frozen apps, the core importer would need to support loading
> > the three modules as frozen modules.
> 
> I'd like to see a description of how someone like Jim A would build a
> single-file application using the new mechanism.  This could
> completely replace freeze.  (Freeze currently requires a C compiler;
> that's bad.)

The portable mechanism for freezing will always need a compiler. Platform
specific mechanisms (e.g. append to the .EXE, or use the linker to create
a new ELF section) can optimize the freeze process in different ways.

I don't have a design in my head for the freeze issues -- I've been
considering that the mechanism would remain about the same. However, I can
easily see that different platforms may want to use different freeze
processes... hmm...

>...
> > Yes. I don't see this as a requirement, though. We wouldn't start to use
> > these by default, would we? Or insist on zlib being present? I see this as
> > more along the lines of "we have provided a standardized Importer to do
> > this, *provided* you have zlib support."
> 
> Agreed.  Zlib support is easy to get, but there are probably platforms
> where it's not.  (E.g. maybe the Mac?  I suppose that on the Mac,
> there would be some importer classes to import from a resource fork.)

Exactly. And importer classes to load from a Win32 resources (modifying a
.EXE's resources post-link is cleaner than the append solution)

>...
> > My outline above does not freeze anything. Everything resides in the
> > filesystem. The C code merely needs a path-scanning loop and functions to
> > import .py*, builtin, and frozen types of modules.
> 
> Good.  Though I think there's also a need for freezing everything.
> And when we go the route of the zip archive, the zip archive handling
> code needs to be somewhere -- frozen seems to be a reasonable choice.

Sure.

> > If somebody nukes their imputil.py or site.py, then they return to Python
> > 1.4 behavior where the core interpreter uses a path for importing (i.e. no
> > packages). They lose dynamically-loaded module support.
> 
> But if the path guessing is also done by site.py (as I propose) the
> path will probably be wrong.  A warning should be printed.

All right. Doesn't Python already print a warning if it can't find
site.py?

> > > Let's first complete the requirements gathering.  Are these
> > > requirements reasonable?  Will they make an implementation too
> > > complex?  Am I missing anything?
> > 
> > I'm not a fan of the compositing due to it requiring a change to semantics
> > that I believe are very useful and very clean. However, I outlined a
> > possible, clean solution to do that (a secondary set of hooks for
> > transforming get_code() return values).
> 
> As you may see from my responses, I'm a big fan of having several
> different sets of hooks.

Yes. However, I've only recognized one so far. Propose more... I'm
confident we can update the PathImporter design to accomodate (and retain
the underlying imputil paradigm).

> I do withdraw the composition requirement
> though.

:-)

>...
> > Once you hit site.py, you have a "full" environment and can easily detect
> > and import a read-eval-print loop module (i.e. why return to Python? just 
> > start things up right there).
> 
> You mean "why return to C?"  I agree.  It would be cool if somehow

Heh. Yah, that's what I meant :-)

> IDLE and Pythonwin would also be bootstrapped using the same
> mechanisms.  (This would also solve the question "which interactive
> environment am I using?" that some modules and apps want to see
> answered because they need to do things differently when run under
> IDLE,for example.)

Haven't thought on this. Should be doable, I'd think.

> > site.py can also install new optimizers as desired, a new Python-based
> > parser or compiler, or whatever...  If Python is built without a parser or
> > compiler (I hope that's an option!), then the three startup modules would
> > simply be frozen into the executable.
> 
> More power to hooks!

:-) You betcha!

I believe my next order of business:

* update PathImporter with the file-extension hook
* dynload C code reorg, per the other email
* create new-model site.py and trash import.c
* review freeze mechanisms and process
* design mechanism for frozen core functionality (eg. getpath*.c)
  (coding and building design)
* shift core functions to Python, using above design

I'll just plow ahead, but also recognize that any/all may change. ie. I'll
build examples/finals/prototypes and Guido can pick/choose/reimplement/etc
as needed. I'm out next week, but should start on the above items by the
end of the month (will probably do another mod_dav release in there
somewhere).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/