Freeze and new import architecture
[Apols in advance for the length of this - but the background is probably necessary] I will make no attempt to outline what the freeze tool should ultimately be! Im not even going to outline what freeze is now! Im just going to focus on one single requirement. We want freeze to be capable of working with any number of "code containers" - ie, to be able to locate code in "frozen C modules" (as it works now), or potentially in a .zip file etc. The justification is fairly easy to explain: Consider that you may want to use "freeze" to distribute an application that consists of a number of Python programs. Using freeze as it stands now, the entire Python library is frozen into each application. Thus the result is each program is frozen to an executable that may be megabytes in size, even though the bulk is likely to be the identical, standard Python library. Attempting to cut a long story short, Guido, Jack Jansen, myself and Just came up with another idea for an extensible import mechanism: sys.path not be restricted to path names. sys.path has "strings", and an associated map of "module finders". Thus, a sys.path entry could have a directory name (like now) or .zip file, URL, etc. I have included below a mail from Jack Jansen on this topic. To paraphrase Guido's response, it was: "looks good - but lets define a Python interface so it may work with JPython" Accordingly, I have mocked up a few .py files which start to implement the ideas in Jack's post. However, Im now floundering a little as to the best way to take this any further. Im really looking for further support or critisism of the idea, and some interest from people in helping take this further... If there is general interest, I will post my mock-up, which allows directories and URLs to exist on sys.path... Thanks, Mark. -----Original Message----- From: Jack Jansen [mailto:Jack.Jansen@cwi.nl] Sent: Wednesday, 12 August 1998 1:11 AM To: guido@python.org; just@letterror.com; mhammond@skippinet.com.au Subject: An import architecture I thought I'd just mail you the results of a discussion Mark and myself had about the import architecture. I'm just sending it out to us four, but I guess we should get more people into the discussion if Guido thinks that this is something worth "solving". The problem we're looking at is that import.c and importdl.c have lots of special case code, and everytime someone comes up with a new neat way to import modules more special case code has to be added. This happened with Mac PYC resource modules and PYD modules, and there's also importdl.c which nows about DLL modules for umpteen different systems. It would be nice if there was an architecture in import.c that would allow the machine specific code to hook into (and, ultimately, Python code as well). WHAT WE HAVE ============ We noticed that there are really two issues involved in importing modules: 1. Finding the module in a specific namespace. 2. Importing a module of a specific type, once it has been found. For 1. we currently have 5 namespaces: builtin modules, frozen modules, the filesystem namespace, the PYC resource namespace in a file (mac-only) and the PYD resource namespace in a file (again mac-only). Other namespaces can be envisioned, for instance the namespace inside a squeeze .pyz archive, a web-based namespace, whatever. The builtin and frozen namespace are currently special, in that they don't occur in sys.path and are always virtually at the very front of sys.path. On the mac sys.path entries can be either filenames (in which case the latter two modulefinders are invoked) or directories (in which case the filesystem finder is invoked), on other platforms there are only directories in sys.path right now. Regarding 2: the finder currently returns a structure that enables the correct importer to be called later on. Importers that we have are for builtin, frozen, .py/.pyc/.pyo modules, various dll-importers all hiding behind the same interface, PYC-resource importers (mac-only) and PYD-resource importers (mac-only). WHAT WE WANT ============ What we'd like I'll try to describe top-down (hopefully better to understand than bottom-up). importing a module becomes something like for pathentry in sys.path: finder = getfinder(pathentry) loader = finder.find(module) module = loader() getfinder() is something like if not path_to_finder.has_key(pathentry): for f in all_finder_types: finder = f.create_finder(pathentry) if finder: path_to_finder[pathentry] = finder break return path_to_finder[pathentry] And there would be a call whereby a finder type registers itself (adds itself to all_finder_types). finder.create_finder() would examine the current path component, see if it could handle it, and, if so, return a finder object that will search this path component everytime it is invoked. The usual case is that getfinder doesn't do much: only the first time you see a new entry in sys.path you'll have to ask all the finders in turn whether they support it, ad you just remember this. The loaders register themselves with the finders, passing finder-specific arguments. For instance, the .py loader registers itself with the filesystem finder, telling it that the ".py" extension should result in a .py loader being created. The unix-specific dll-loader does the same for .so extensions, the windows-dll-loader for .dll, the mac-dll-loader for .slb etc. The PYC-resource loader tells the mac-resource-finder to look for 'PYC ' resources, the PYD-resource loader tells it to look for 'PYD ' resources, etc. The information that is passed from the finder to loader when it has found a module is again finder-specific: the filesystem loader will probably pass an open file and a filename, etc. A loader can register itself with multiple finders, assuming their interfaces are similar. So, the .py loader could register itself not only with the filesystem finder but also with a url-based finder or something, as long as that url-based finder uses the same calling convention for creating the loader. WHAT DOES IT BUY US =================== A greatly simplified import.c, importdl.c split out over the various platforms (and with the possibility to pass machine-specific info from the find phase to the load phase, something that cant be done now and leads to double work on various platforms) and easy extensibility. There is the issue of performance. The description above is all Pythonish, but going through the Python calling sequence for all these things is probably not a good idea from a performance standpoint. This is however fixable. The objects involved are - The finder "class", the thing you use to check whether a certain sys.path component can be handled by this code - The finder instance returned by this class - The importer class (called by the finder objects) - The importer instance (returned by the finder instance through invoking the importer class). These 4 things could well be 4 specific PyObject types, with the needed C-routines in the object struct. Calling these for the normal case (i.e. when they're implemented in C) would be at most a single (C-) indirection more expensive than the current scheme. Moreover, it would be easy to create a module that would implement these 4 types as wrappers around Python code. The C-routines would then do the usual PyEval_CallObject stuff to call the Python implementation. So, you'd have the generality but you'd only pay for it when you put Python-handled entries in sys.path and actually hit those entries. ODDS AND ENDS ============= A side issue: this stuff would also allow us to put the builtin and frozen namespace into sys.path explicitly, for instance as "__builtins__" and "__frozen__", something I would like. The disadvantage would be that you can't be sure that everything on sys.path is a pathname, but the advantage would be that you could, for instance create a frozen program that you could patch: set sys.path to ["/usr/local/FooPatches", "__frozen__", "__builtins__"], and whenever you have a patch to a single module in a frozen executable you just send your clients the single .pyc file and tell them to put it in the FooPatches directory. There'd probably have to be a bit of code that explicitly prepends "__frozen__" and "__builtins__" to sys.path if they aren't there already or something. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
I've messed around quite a bit with funky import mechanisms also and have found the current setup a bit tough to work within. I'll insert some comments below, but never fear... I've got more, too :-) Mark Hammond wrote:
... sys.path not be restricted to path names. sys.path has "strings", and an associated map of "module finders". Thus, a sys.path entry could have a directory name (like now) or .zip file, URL, etc.
I would much prefer to see the module finder instances in the sys.path. Sometimes, it is *very* difficult to map strings to module finders. For example, if you have a .dll with frozen code in it, and the code has been frozen in one of N formats, then how can you determine which module finder to use for the format? IMO, it is better to insert the finder itself: sys.path.append(GZippedDLLResource("modulename", "mycode.dll")) sys.path.append(DLLResourceGroup("mymodules.dll"))
... Jack Jansen wrote: ... WHAT WE HAVE ============
We noticed that there are really two issues involved in importing modules: 1. Finding the module in a specific namespace. 2. Importing a module of a specific type, once it has been found.
I think the separation is bogus. Trying to fit into the Finder/Loader paradigm of the ihooks has always been a total, non-intuitive pain-in-the-ass for me (to be blunt :-). Instead, I just go straight for the import hook and ignore the whole ihooks thing. I would put forward that we ignore the find/load paradigm and simply go to: 1. Import the given module if you can If an element on sys.path can't do it (returning None), then you move to the next one.
... The builtin and frozen namespace are currently special, in that they don't occur in sys.path and are always virtually at the very front of sys.path. On the mac sys.path entries can be either filenames (in which case the latter two modulefinders are invoked) or directories (in which case the filesystem finder is invoked), on other platforms there are only directories in sys.path right now.
Simple to do: sys.path.insert(0, BuiltinImporter()) The BuiltinImporter handles compiled-in and frozen modules.
Regarding 2: the finder currently returns a structure that enables the correct importer to be called later on. Importers that we have are for builtin, frozen, .py/.pyc/.pyo modules, various dll-importers all hiding behind the same interface, PYC-resource importers (mac-only) and PYD-resource importers (mac-only).
Punt this. Just import the dumb thing in one shot. Take the example of an HTTP-based import. Separating that into *two* transactions would be painful. It should be imported in one fell swoop. And no, you can't just keep the socket open and pass that to the loader -- that implies that you can defer the passing for a while, but the web server will time out your connection and close it. Conversely, if the intent is *not* to hold the "structure" for a while, then why the heck have two pieces?
WHAT WE WANT ============
What we'd like I'll try to describe top-down (hopefully better to understand than bottom-up).
importing a module becomes something like
for pathentry in sys.path: finder = getfinder(pathentry) loader = finder.find(module) module = loader()
I'll amend this to: for pathentry in sys.path: if type(pathentry) == StringType: module = old_import(pathentry, modname) else: module = pathentry.do_import(modname) if module: return module else: raise ImportError, modname + " not found."
getfinder() is something like if not path_to_finder.has_key(pathentry): for f in all_finder_types: finder = f.create_finder(pathentry) if finder: path_to_finder[pathentry] = finder break return path_to_finder[pathentry]
The above code is basically keeping a mirror of the sys.path list, but with importer instances in it. Just put those into sys.path itself.
And there would be a call whereby a finder type registers itself (adds itself to all_finder_types).
In my proposal, this wouldn't be necessary. You insert finders right into sys.path.
... A loader can register itself with multiple finders, assuming their interfaces are similar. So, the .py loader could register itself not only with the filesystem finder but also with a url-based finder or something, as long as that url-based finder uses the same calling convention for creating the loader.
I'd rephrase this as you have multiple importer instances, each configured for a different "path" to its module namespace.
WHAT DOES IT BUY US ===================
A greatly simplified import.c, importdl.c split out over the various platforms (and with the possibility to pass machine-specific info from the find phase to the load phase, something that cant be done now and leads to double work on various platforms) and easy extensibility.
Agreed. Quite necessary. I would think that we could have different little code bits for each platform, much like we have thread_*.h in the Python/ subdirectory.
There is the issue of performance. The description above is all Pythonish, but going through the Python calling sequence for all these things is probably not a good idea from a performance standpoint. This is however fixable.
I disagree. In almost all cases, you are talking about bringing a module into the interpreter. Your performance is going to be characterized by I/O, memory allocations for all the structures that get built when the module executes, and the actual execution time of that module. The time spent using Python to perform the import sequence is a non-issue.
...
ODDS AND ENDS =============
A side issue: this stuff would also allow us to put the builtin and frozen namespace into sys.path explicitly, for instance as "__builtins__" and "__frozen__", something I would like. The disadvantage would be that you can't be sure that everything on sys.path is a pathname, but the advantage would be that you could, for instance create a frozen program that you could patch:
Both of our proposals guarantee that stuff in sys.path are not pathnames. If I insert a "foo.zip" or a "http://host.domain.name/pymodules/", then you certainly dont have pathnames. I believe the biggest issue with my proposal is the fact that the values are no longer strings. However, the out-of-the-box version of Python can easily contain *just* strings. If people bring in custom importers into their application, AND they have code that depends on the "string-ness" of sys.path, then it is their problem. By definition, they've altered the behavior of their app and they need to compensate; my proposal is backwards compatible for existing apps. [ the tweak would be to avoid inserting BuiltinImporter() -- the instance/type could still exist, but merely be *implied* rather than explicitly within sys.path ]
set sys.path to ["/usr/local/FooPatches", "__frozen__", "__builtins__"], and whenever you have a patch to a single module in a frozen executable you just send your clients the single .pyc file and tell them to put it in the FooPatches directory. There'd probably have to be a bit of code that explicitly prepends "__frozen__" and "__builtins__" to sys.path if they aren't there already or something.
This is quite humorous... Small world: I've used this approach before. In the Microsoft Merchant Server 1.0, we had a bunch of frozen Python code. However, we also looked in a specific directly for patches. We never patched it :-), but it was possible. Cheers, -g -- Greg Stein, http://www.lyra.org/
participants (2)
-
Greg Stein
-
Mark Hammond