[Distutils] PEP 376 - site-directories and site.addsitedir

P.J. Eby pje at telecommunity.com
Thu May 14 18:33:10 CEST 2009


At 12:03 PM 5/14/2009 +0200, Tarek Ziadé wrote:
>Hello
>
>for PEP 376, I have one last fuzzy point.
>
>http://svn.python.org/view/peps/trunk/pep-0376.txt?view=markup
>
>The "get_egg_info" api is currently based on scanning the whole
>sys.path. And since sys.path can be modified by people,
>so the algorithm is linear and can slow down when there are a lot of paths.
>
>I have a proposal: let's restrict the search for this API to
>site-package directories only. (directories added with
>site.addsitedir)
>
>People will be able to mark add any directory (like the per-user
>site-package directory - http://www.python.org/dev/peps/pep-0370)
>
>This requires to add in site.py a registry to keep track of all
>directories added through site.addsitedir
>
>Any thoughts ?

What tradeoffs are you optimizing for?  Note that a single scan of 
every directory on sys.path is exactly what happens when an import 
doesn't find its target until the *last* directory on sys.path.  So 
this is not really a big deal if you're only doing it *once*.

If you want to optimize for repeated searches, the best way to do 
this is with a structure like pkg_resources' WorkingSet object - it 
simply reads the directories once and makes an object for each 
installed package.  These objects don't do any further I/O, so really 
we're just talking about caching a list of .egg-info filenames.

Each object in the set can be queried for its metadata -- in which 
case it reads it exactly once, and caches it.

With this setup, the full directory scan is only ever done once -- 
and it's basically equivalent to adding an extra import at the time 
you first import the metadata management module.

Yes, it does mean a global, unless you want to hand off cache 
management to the application.  But the way pkg_resources does it, 
with WorkingSet and Distribution objects, allows an app with special 
needs to do its own path management and search operations.

IOW, this approach keeps simple things simple, and leaves complex 
things possible.  It also does less I/O than what you're proposing, 
since in the normal case the directories are only ever searched once, 
and the actual metadata reads are both lazy and cached.

Note, too, that site-packages dirs are likely to have more packages 
on them than other directories, which means you're not necessarily 
saving much I/O to start with, and even that small savings evaporates 
as soon as you do more than one lookup for plugins.



More information about the Distutils-SIG mailing list