[Python-3000] Pre-PEP on fast imports
Phillip J. Eby
pje at telecommunity.com
Tue Jun 12 01:18:40 CEST 2007
At 12:46 AM 6/12/2007 +0200, Giovanni Bajo wrote:
>Hi Philip,
>
>I'm going to submit a PEP for Python 3000 (and possibly backported
>as an option off by default in Python 2). It's related to imports
>and how to make them faster. Given your expertise on the subject,
>I'd appreciate if you could review my ideas. I briefly spoken of it
>with Alex Martelli a few days ago at PyCon Italia and he was not
>negative about it.
>
>Problems:
>
>- A single import causes many syscalls (.pyo, .pyc, .py, in both
>directory and .zip file).
>- Situation is getting worse and worse with the advent of
>easy_install which produces many .pth files (longer sys.path).
>- Python startup time is slow, and a noticable fraction of it is
>dominated by site.py-related stuff (a simple hello world runs takes
>0.012s if run without -S, and 0.008s if run with -S).
>- Many people might not be interested in this, but others are really
>concerned. Eg: again at PyCon italia, I spoke with one of the
>leading Sugar programmers (OLPC) who told me that one of the biggest
>blocker right now is the python startup time (applications on latest
>OLPC prototype take 3-4 seconds to startup). He suggested that this
>was related to the large number of syscalls made for imports.
>
>
>Proposed solution:
>
>- A site cache is introduced. It's a dictionary mapping module names
>to absolute file paths.
>- When an import occurs, for each directory/zipfile we walk in
>sys.path, we read all directory entries, and update the site cache
>with all the Python modules found in it (all the Python modules
>found in the directory/zipfile).
>- If the filepath for a certain module is found in the site cache,
>the module is directly accessed. Otherwise, sys.path is walked.
>- The site cache can be cleared with sys.clear_site_cache(). This
>must be used after manual editing of sys.path (or could be done
>automatically by making sys.path a list subclass which notices each
>modification).
>- The site cache must be manually cleared if a Python file is added
>to a directory in sys.path after the application has started. This
>is a rare-enough scenario to require an additional explicit call.
>- If for whatever reason a filepath found in the site cache cannot
>be accessed (unmounted device, whatever) ImportError is raised.
>Again, this is something which is very rare and does not require
>much attention.
Here's a simpler solution, one that's easily testable using existing
Python versions. Create a subclass of pkgutil.ImpImporter
(Python >=2.5) that caches a listdir of its contents, and uses it to
immediately reject any find_module() requests for which matching data
is not in its cached listdir. Add this class to sys.path_hooks, and
see if it speeds things up.
If it doesn't produce an improvement, your more-ambitious version of
the idea won't work. If it does produce an improvement, it's likely
to be much simpler to implement at the C level than your idea
is. Meanwhile, it doesn't tear up the import machinery with a new
special-purpose mechanism; it simply leverages the existing hooks.
The subclass might look something like this:
import imp, os, sys
from pkgutil import ImpImporter
suffixes = set(ext for ext,mode,typ in imp.get_suffixes())
class CachedImporter(ImpImporter):
def __init__(self, path):
if not os.path.isdir(path):
raise ImportError("Not an existing directory")
super(CachedImporter, self).__init__(path)
self.refresh()
def refresh(self):
self.cache = set()
for fname in os.listdir(path):
base, ext = os.path.splitext(fname)
if ext in suffixes and '.' not in base:
self.cache.add(base)
def find_module(self, fullname, path=None):
if fullname.split(".")[-1] not in self.cache:
return None # no need to check further
return super(CachedImporter, self).find_module(fullname, path)
sys.path_hooks.append(CachedImporter)
Stick this at the top of your site.py and see what happens. I'll be
interested to hear the results. (Notice, by the way, that with this
implementation one can easily clear the entire cache by clearing
sys.path_importer_cache, or deleting the entry for a specific path,
as well as by taking the entry for that path and calling its refresh() method.)
More information about the Python-3000
mailing list