[Python-3000] Pre-PEP on fast imports

Giovanni Bajo rasky at develer.com
Tue Jun 12 18:40:01 CEST 2007


On 6/12/2007 6:30 PM, Phillip J. Eby wrote:

>>      import imp, os, sys
>>      from pkgutil import ImpImporter
>>
>>      suffixes = set(ext for ext,mode,typ in imp.get_suffixes())
>>
>>      class CachedImporter(ImpImporter):
>>          def __init__(self, path):
>>              if not os.path.isdir(path):
>>                  raise ImportError("Not an existing directory")
>>              super(CachedImporter, self).__init__(path)
>>              self.refresh()
>>
>>          def refresh(self):
>>              self.cache = set()
>>              for fname in os.listdir(path):
>>                  base, ext = os.path.splitext(fname)
>>                  if ext in suffixes and '.' not in base:
>>                      self.cache.add(base)
>>
>>          def find_module(self, fullname, path=None):
>>              if fullname.split(".")[-1] not in self.cache:
>>                  return None  # no need to check further
>>              return super(CachedImporter, self).find_module(fullname, 
>> path)
>>
>>      sys.path_hooks.append(CachedImporter)
> 
> After a bit of reflection, it seems the refresh() method needs to be a 
> bit different:
> 
>           def refresh(self):
>               cache = set()
>               for fname in os.listdir(self.path):
>                   base, ext = os.path.splitext(fname)
>                   if not ext or (ext in suffixes and '.' not in base):
>                       cache.add(base)
>               self.cache = cache
> 
> This version fixes two problems: first, a race condition could occur if 
> you called refresh() while an import was taking place in another 
> thread.  This version fixes that by only updating self.cache after the 
> new cache is completely built.
> 
> Second, the old version didn't handle packages at all.  This version 
> handles them by treating extension-less filenames as possible package 
> directories.  I originally thought this should check for a subdirectory 
> and __init__, but this could get very expensive if a sys.path directory 
> has a lot of subdirectories (whether or not they're packages).  Having 
> false positives in the cache (i.e. names that can't actually be 
> imported) could slow things down a bit, but *only* if those names match 
> something you're trying to import.  Thus, it seems like a reasonable 
> trade-off versus needing to scan every subdirectory at startup or even 
> to check whether all those names *are* subdirectories.

There is another couple of things I'll fix as soon as I try it. First is 
that I'd call refresh() lazily on the first find_module because I don't 
want to listdir() directories on sys.path that will never be accessed.

The idea of using sys.path_hooks is very clever (I hadn't thought of 
it... because I didn't know of path_hooks in the first place! It appears 
to be undocumented and sparsely indexed by google as well), and it will 
probably help me a lot in my task of fixing this problem in the 2.x serie.
-- 
Giovanni Bajo



More information about the Python-3000 mailing list