Re: [Distutils] PEP 376 - site-directories and site.addsitedir
At 12:03 PM 5/14/2009 +0200, Tarek Ziadé wrote:
Hello
for PEP 376, I have one last fuzzy point.
http://svn.python.org/view/peps/trunk/pep-0376.txt?view=markup
The "get_egg_info" api is currently based on scanning the whole sys.path. And since sys.path can be modified by people, so the algorithm is linear and can slow down when there are a lot of paths.
I have a proposal: let's restrict the search for this API to site-package directories only. (directories added with site.addsitedir)
People will be able to mark add any directory (like the per-user site-package directory - http://www.python.org/dev/peps/pep-0370)
This requires to add in site.py a registry to keep track of all directories added through site.addsitedir
Any thoughts ?
What tradeoffs are you optimizing for? Note that a single scan of every directory on sys.path is exactly what happens when an import doesn't find its target until the *last* directory on sys.path. So this is not really a big deal if you're only doing it *once*. If you want to optimize for repeated searches, the best way to do this is with a structure like pkg_resources' WorkingSet object - it simply reads the directories once and makes an object for each installed package. These objects don't do any further I/O, so really we're just talking about caching a list of .egg-info filenames. Each object in the set can be queried for its metadata -- in which case it reads it exactly once, and caches it. With this setup, the full directory scan is only ever done once -- and it's basically equivalent to adding an extra import at the time you first import the metadata management module. Yes, it does mean a global, unless you want to hand off cache management to the application. But the way pkg_resources does it, with WorkingSet and Distribution objects, allows an app with special needs to do its own path management and search operations. IOW, this approach keeps simple things simple, and leaves complex things possible. It also does less I/O than what you're proposing, since in the normal case the directories are only ever searched once, and the actual metadata reads are both lazy and cached. Note, too, that site-packages dirs are likely to have more packages on them than other directories, which means you're not necessarily saving much I/O to start with, and even that small savings evaporates as soon as you do more than one lookup for plugins.
2009/5/14 P.J. Eby <pje@telecommunity.com>:
IOW, this approach keeps simple things simple, and leaves complex things possible. It also does less I/O than what you're proposing, since in the normal case the directories are only ever searched once, and the actual metadata reads are both lazy and cached.
Makes a lot of sense yes. I think I'll just start a prototype for that code, propose 376 on python-dev and see.
participants (2)
-
P.J. Eby
-
Tarek Ziadé