M.-A. Lemburg writes:
Wow, what an analysis.
And such fun, as well! ;-)
It's the .py[co] files that are expensive to load! Once you've created the package, sub-modules are very cheap: you will typically have no more than two path entries to check even once all this is in place.
I'm not sure I follow you here: do you mean with a package dir cache in place or using the system implemented in the current
Anything contained within a package is relatively cheap to load because the search path is shorter. Currently, if the __init__.py* does nothing to the __path__, there's only one entry! In the current scheme, the .py[co] files are the last thing checked within a directory during the search. Loading one of these costs more in searching than any other type of module. Of course, parsing Python isn't free either, so loading a .py file for which no .py[co] exists is really more expensive, it's just found a little sooner. I said:
caching; loading Grail is still dog slow, and I've no doubt that the 600+ stat() calls contribute to that! 1-)
And then I corrected myself:
Oops, after following through with the math, I'd have to adjust this to 6000 stat()/open() calls for Grail. Sorry!
Ok, I loaded Grail and looked more carefully. I was thinking it was loading about 100 modules. Well, that's at the point that it loads the users .grail/user/grailrc.py (if it exists). By the time my home page was loaded, there were 145 distinct module objects loaded into sys.modules, and 17 entries on sys.path. Lots of Grail modules are in packages these days, but there are also a lot loaded from the standard library. So lets say there are probably around 5000 stat()/open() calls (reduce the number due to package use, then increase it again because (a) there are more modules being loaded than I'd estimated, and (b) the standard library is quite a ways down sys.path.
This seems like something to worry about and probably also enough to try really hard to find a good solution, IMHO.
This is where a good caching system makes a lot of sense.
True, that's why the hook allows you to code the strategy in Python. Note that my current version uses the sys.path as key into a table of name:file mappings, so even when using different setups (which will certainly have some differences in sys.path), the cache should work. Maybe one should add some more information to the key... like the platform specifica or the even the mtimes of the directories on the path.
I'm not sure that keying on sys.path is sufficient. Around here, a Solaris/SPARC and Solaris/x86 box are likely to share the same sys.path. That doesn't mean the directories are the same; the differences are taken care of via NFS. Using the mtimes as part of the key means you don't have any way to clear the cache: an older mtime may just mean the version of the path for a different platform, which still wants to use the cache! Perhaps it could be keyed on (platform, dir), and the mtimes could be used to determine the need to refresh that directory. Doing this right is hard, and can be substantially affected by a site's filesystem layout. Avoiding problems due to issues like these is a good reason to use a runtime-only cache. A site for which this isn't sufficient can the use the "hook" mechanism to install something that can do better within the context of specific filesystem management policies.
Yep, remember that too. The problem with these scans is that directories may contain huge amounts of files and you would need to check all of them against the module extensions Python
They probably won't contain much other than Python modules in a reasonable installation. There's no need to filter the list; just include every file, and then test for the appropriate entries when attempting a specific import. This limits the up-front cost substantially. If we don't assume a reasonable installation (non-module files in the module dirs), it just gets slower and people have an incentive to clean up their installation. This is acceptable.
Anyway, the dynamic and static versions are both implementable using the hook, so I'd opt for going into that direction rather than hard-wiring some logic into the interpreters core.
I have no problems with using a "hook" to implement a more efficient
mechanism. I just want the "standard" mechanism to be efficient,
because that's the one I'll use.
-Fred
--
Fred L. Drake, Jr.