[Python-3000] Pre-PEP on fast imports

Phillip J. Eby pje at telecommunity.com
Tue Jun 12 01:18:40 CEST 2007


At 12:46 AM 6/12/2007 +0200, Giovanni Bajo wrote:
>Hi Philip,
>
>I'm going to submit a PEP for Python 3000 (and possibly backported 
>as an option off by default in Python 2). It's related to imports 
>and how to make them faster. Given your expertise on the subject, 
>I'd appreciate if you could review my ideas. I briefly spoken of it 
>with Alex Martelli a few days ago at PyCon Italia and he was not 
>negative about it.
>
>Problems:
>
>- A single import causes many syscalls (.pyo, .pyc, .py, in both 
>directory and .zip file).
>- Situation is getting worse and worse with the advent of 
>easy_install which produces many .pth files (longer sys.path).
>- Python startup time is slow, and a noticable fraction of it is 
>dominated by site.py-related stuff (a simple hello world runs takes 
>0.012s if run without -S, and 0.008s if run with -S).
>- Many people might not be interested in this, but others are really 
>concerned. Eg: again at PyCon italia, I spoke with one of the 
>leading Sugar programmers (OLPC) who told me that one of the biggest 
>blocker right now is the python startup time (applications on latest 
>OLPC prototype take 3-4 seconds to startup). He suggested that this 
>was related to the large number of syscalls made for imports.
>
>
>Proposed solution:
>
>- A site cache is introduced. It's a dictionary mapping module names 
>to absolute file paths.
>- When an import occurs, for each directory/zipfile we walk in 
>sys.path, we read all directory entries, and update the site cache 
>with all the Python modules found in it (all the Python modules 
>found in the directory/zipfile).
>- If the filepath for a certain module is found in the site cache, 
>the module is directly accessed. Otherwise, sys.path is walked.
>- The site cache can be cleared with sys.clear_site_cache(). This 
>must be used after manual editing of sys.path (or could be done 
>automatically by making sys.path a list subclass which notices each 
>modification).
>- The site cache must be manually cleared if a Python file is added 
>to a directory in sys.path after the application has started. This 
>is a rare-enough scenario to require an additional explicit call.
>- If for whatever reason a filepath found in the site cache cannot 
>be accessed (unmounted device, whatever) ImportError is raised. 
>Again, this is something which is very rare and does not require 
>much attention.

Here's a simpler solution, one that's easily testable using existing 
Python versions.  Create a subclass of pkgutil.ImpImporter 
(Python >=2.5) that caches a listdir of its contents, and uses it to 
immediately reject any find_module() requests for which matching data 
is not in its cached listdir.  Add this class to sys.path_hooks, and 
see if it speeds things up.

If it doesn't produce an improvement, your more-ambitious version of 
the idea won't work.  If it does produce an improvement, it's likely 
to be much simpler to implement at the C level than your idea 
is.  Meanwhile, it doesn't tear up the import machinery with a new 
special-purpose mechanism; it simply leverages the existing hooks.

The subclass might look something like this:

     import imp, os, sys
     from pkgutil import ImpImporter

     suffixes = set(ext for ext,mode,typ in imp.get_suffixes())

     class CachedImporter(ImpImporter):
         def __init__(self, path):
             if not os.path.isdir(path):
                 raise ImportError("Not an existing directory")
             super(CachedImporter, self).__init__(path)
             self.refresh()

         def refresh(self):
             self.cache = set()
             for fname in os.listdir(path):
                 base, ext = os.path.splitext(fname)
                 if ext in suffixes and '.' not in base:
                     self.cache.add(base)

         def find_module(self, fullname, path=None):
             if fullname.split(".")[-1] not in self.cache:
                 return None  # no need to check further
             return super(CachedImporter, self).find_module(fullname, path)

     sys.path_hooks.append(CachedImporter)

Stick this at the top of your site.py and see what happens.  I'll be 
interested to hear the results.  (Notice, by the way, that with this 
implementation one can easily clear the entire cache by clearing 
sys.path_importer_cache, or deleting the entry for a specific path, 
as well as by taking the entry for that path and calling its refresh() method.)



More information about the Python-3000 mailing list