[Python-3000] Pre-PEP on fast imports

Thu Jun 14 06:01:43 CEST 2007

On 6/12/07, Giovanni Bajo <rasky at develer.com> wrote:
>
> On 6/12/2007 6:30 PM, Phillip J. Eby wrote:
>
> >>      import imp, os, sys
> >>      from pkgutil import ImpImporter
> >>
> >>      suffixes = set(ext for ext,mode,typ in imp.get_suffixes())
> >>
> >>      class CachedImporter(ImpImporter):
> >>          def __init__(self, path):
> >>              if not os.path.isdir(path):
> >>                  raise ImportError("Not an existing directory")
> >>              super(CachedImporter, self).__init__(path)
> >>              self.refresh()
> >>
> >>          def refresh(self):
> >>              self.cache = set()
> >>              for fname in os.listdir(path):
> >>                  base, ext = os.path.splitext(fname)
> >>                  if ext in suffixes and '.' not in base:
> >>                      self.cache.add(base)
> >>
> >>          def find_module(self, fullname, path=None):
> >>              if fullname.split(".")[-1] not in self.cache:
> >>                  return None  # no need to check further
> >>              return super(CachedImporter, self).find_module(fullname,
> >> path)
> >>
> >>      sys.path_hooks.append(CachedImporter)
> >
> > After a bit of reflection, it seems the refresh() method needs to be a
> > bit different:
> >
> >           def refresh(self):
> >               cache = set()
> >               for fname in os.listdir(self.path):
> >                   base, ext = os.path.splitext(fname)
> >                   if not ext or (ext in suffixes and '.' not in base):
> >                       cache.add(base)
> >               self.cache = cache
> >
> > This version fixes two problems: first, a race condition could occur if
> > you called refresh() while an import was taking place in another
> > thread.  This version fixes that by only updating self.cache after the
> > new cache is completely built.
> >
> > Second, the old version didn't handle packages at all.  This version
> > handles them by treating extension-less filenames as possible package
> > directories.  I originally thought this should check for a subdirectory
> > and __init__, but this could get very expensive if a sys.path directory
> > has a lot of subdirectories (whether or not they're packages).  Having
> > false positives in the cache (i.e. names that can't actually be
> > imported) could slow things down a bit, but *only* if those names match
> > something you're trying to import.  Thus, it seems like a reasonable
> > trade-off versus needing to scan every subdirectory at startup or even
> > to check whether all those names *are* subdirectories.
>
> There is another couple of things I'll fix as soon as I try it. First is
> that I'd call refresh() lazily on the first find_module because I don't
> want to listdir() directories on sys.path that will never be accessed.
>
> The idea of using sys.path_hooks is very clever (I hadn't thought of
> it... because I didn't know of path_hooks in the first place! It appears
> to be undocumented and sparsely indexed by google as well), and it will
> probably help me a lot in my task of fixing this problem in the 2.x serie.

PEP 302 documents all of this, but unfortunately was never documented in the
official docs.

I also have some pseudocode of how import (roughly) works at
sandbox/trunk/import_in_py/pseudocode.py .

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070613/8998fc87/attachment.htm