[Distutils] extensions in packages

Thu, 27 May 1999 10:48:48 -0400 (EDT)

M.-A. Lemburg writes:
 > Wow, what an analysis. 

  And such fun, as well!  ;-)

 > > It's the .py[co] files that
 > > are expensive to load!  Once you've created the package, sub-modules
 > > are very cheap: you will typically have no more than two path entries
 > > to check even once all this is in place.
 > 
 > I'm not sure I follow you here: do you mean with a package dir
 > cache in place or using the system implemented in the current

  Anything contained within a package is relatively cheap to load
because the search path is shorter.  Currently, if the __init__.py*
does nothing to the __path__, there's only one entry!
  In the current scheme, the .py[co] files are the last thing checked
within a directory during the search.  Loading one of these costs more 
in searching than any other type of module.  Of course, parsing Python 
isn't free either, so loading a .py file for which no .py[co] exists
is really more expensive, it's just found a little sooner.

I said:
 > caching; loading Grail is still dog slow, and I've no doubt that the
 > 600+ stat() calls contribute to that!  1-)

And then I corrected myself:
 >   Oops, after following through with the math, I'd have to adjust this
 > to 6000 stat()/open() calls for Grail.  Sorry!

  Ok, I loaded Grail and looked more carefully.  I was thinking it was 
loading about 100 modules.  Well, that's at the point that it loads
the users .grail/user/grailrc.py (if it exists).  By the time my home
page was loaded, there were 145 distinct module objects loaded into
sys.modules, and 17 entries on sys.path.  Lots of Grail modules are in 
packages these days, but there are also a lot loaded from the standard
library.  So lets say there are probably around 5000 stat()/open()
calls (reduce the number due to package use, then increase it again
because (a) there are more modules being loaded than I'd estimated,
and (b) the standard library is quite a ways down sys.path.

 > This seems like something to worry about and probably also enough
 > to try really hard to find a good solution, IMHO.

  This is where a good caching system makes a lot of sense.

 > True, that's why the hook allows you to code the strategy in
 > Python. Note that my current version uses the sys.path as
 > key into a table of name:file mappings, so even when using
 > different setups (which will certainly have some differences in
 > sys.path), the cache should work. Maybe one should add some
 > more information to the key... like the platform specifica
 > or the even the mtimes of the directories on the path.

  I'm not sure that keying on sys.path is sufficient.  Around here, a
Solaris/SPARC and Solaris/x86 box are likely to share the same
sys.path.  That doesn't mean the directories are the same; the
differences are taken care of via NFS.  Using the mtimes as part of
the key means you don't have any way to clear the cache: an older
mtime may just mean the version of the path for a different platform,
which still wants to use the cache!  Perhaps it could be keyed on
(platform, dir), and the mtimes could be used to determine the need to 
refresh that directory.
  Doing this right is hard, and can be substantially affected by a
site's filesystem layout.  Avoiding problems due to issues like these
is a good reason to use a runtime-only cache.  A site for which this
isn't sufficient can the use the "hook" mechanism to install something 
that can do better within the context of specific filesystem
management policies.

 > Yep, remember that too. The problem with these scans is that
 > directories may contain huge amounts of files and you would
 > need to check all of them against the module extensions Python

  They probably won't contain much other than Python modules in a
reasonable installation.  There's no need to filter the list; just
include every file, and then test for the appropriate entries when
attempting a specific import.  This limits the up-front cost
substantially.
  If we don't assume a reasonable installation (non-module files in
the module dirs), it just gets slower and people have an incentive to
clean up their installation.  This is acceptable.

 > Anyway, the dynamic and static versions are both implementable
 > using the hook, so I'd opt for going into that direction
 > rather than hard-wiring some logic into the interpreters core.

  I have no problems with using a "hook" to implement a more efficient 
mechanism.  I just want the "standard" mechanism to be efficient,
because that's the one I'll use.

  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives