Initial auto-installation support

I've added some very initial support for automatic installation of packages in Paste, using easy_install (http://peak.telecommunity.com/DevCenter/EasyInstall). In a configuration file you can put: use_package(package_name, pkg_download_url=None) And a couple other options, but we'll ignore those. It will look for the named package, and if not found will install it (generally in app-packages). This still doesn't work for all the packages that build-pkg.py fetches; specifically flup and Component don't have setup.py files, and PySourceColor.py is just a bare Python module. I'd like for this system to work in spite of that, but it might also make sense to just fix all of those. (And actually I'm not super-enthusiastic about PySourceColor, so if anyone has opinions on a better source colorizer I'm open.) Right now there's some URLs for common packages in paste/default_config.conf (package_urls) -- wsgiutils, ZPTKit, ZopePageTemplates, scgi, and SQLObject. Ultimately this data really belongs in PyPI, so that dictionary of URLs is just transitional. There's another aspect to Paste installation, where some packages (plugins) need to write things into Paste. I'm not sure quite how that will work -- maybe use_package() will see if there's a paste_install module in the package somewhere, and call that somehow. But besides that, this should work now for any packages with a distutils install, so long as those packages are reasonably well behaved. Hrm... except setuptools 0.3a2 doesn't have SourceForge download support, but 0.3a3 does and I think PJE will release that soon. Eh, I could come up with a bunch of other caveats too... this stuff is rough still, but feedback is important. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

At 04:10 PM 5/30/2005 -0500, Ian Bicking wrote:
FYI, just so you know, your implementation won't handle nested dependencies, nor allow specifying optional features of requested packages. I'd suggest that in a later version, you take a look at subclassing pkg_resources.AvailableDistributions, and overriding the 'obtain()' method to do the search and installation. In this way, your 'obtain()' method will get called for any dependencies that the originally-requested package requires. To see what I mean, take a look at the source of pkg_resources.require(), which looks like this: requirements = parse_requirements(requirements) for dist in AvailableDistributions().resolve(requirements): dist.install_on(sys.path) So, if you subclass AvailableDistributions to define an 'obtain()' method, and then write a similar loop using that subclass, you'll be able to cleanly integrate the auto-download in a forward-compatible way for packages that declare dependencies in their PackageName.egg-info.
That might be a good candidate for egg metadata; if a package is a Paste plugin, you could require a 'paste-install' file in the package's EGG-INFO. Note that the AvailableDistributions().resolve() method returns a list of Distribution objects, and distribution objects have a 'metadata' attribute that implements 'IMetadataProvider'. So, 'aDistribution.metdata.has_metadata("paste-install")' will tell you if that distribution has a "paste-install" file in its EGG-INFO, and using the get_metadata() method will give it to you as a string. Thus, you can do something like: for dist in MyDownloadDistributions().resolve(requirements): dist.install_on(sys.path) if dist.metadata.has_metadata('paste-install'): doSomething(dist.metadata.get_metadata('paste-install')) If you use this algorithm, it will work for all egg varieties: compressed, uncompressed, and "development" eggs. You will, however, need to keep track of which paste-install scripts you've already processed, because 'resolve()' can yield distributions that are already present on sys.path. Also, you may want to create and cache a single instance of your MyDownloadDistributions class, because creating one does a lot of filesystem stats and listdirs and such. Of course, if you have people doing TheirPackage.egg-info/paste-install, they can also create TheirPackage.egg-info/depends.txt, and list all their dependencies there, with no need to use 'use_package'. It won't help with download URLs, though. But we could perhaps define an EGG-INFO/download_urls.txt as a stopgap, that lists known download urls... Anyway, as you can see, eggs were definitely designed with plugin systems like this in mind. :)
Hopefully within the next 24 hours or so. It will also include sandboxing support (automatically aborts the install if the package tries to write to the filesystem outside the build directory), and lots of workarounds to support various packages out there that have quirky install_data subclasses. Those items are already done, but items still on my to-do list include: * help message to explain how to use require() for multi-version/instdir installs * a --build-dir/-b option to set the build directory, that will leave the downloaded package and its extracted contents in place after the installation is complete. (So you can read docs, install scripts, or debug a failed installation.) And I'd like to do something about scripts, but I think that's going to get left to an 0.4a1 release, assuming there are no further bug fix releases needed in the 0.3 line.

0.3a3 is now released, with a new --build-dir option, sandboxing, more package workarounds, SourceForge mirror selection, and "installation reports". See: http://peak.telecommunity.com/DevCenter/EasyInstall#release-notes-change-his... for more details. I'm thinking that adding automatic package location via PyPI is probably pretty doable now, by the way. My plan is to create a PackageFinder class (subclassing AvailableDistributions) whose obtain() method searches for the desired package on PyPI, keeping a cache of URLs it has already seen. (It would also accept a callback argument that it would use to create Installer objects when it needs to install packages.) The command-line tool (easy_install.main) would create a PackageFinder with an interactive installation callback, and in the main loop it would pass it to each new Installer instance. The Installer would then use it whenever it gets a non-file, non-URL command line option, and use it to resolve() such requests. The PackageFinder.obtain() method would go to the PyPI base URL followed by the desired distribution name, e.g. 'http://www.python.org/pypi/SQLObject', and then scrape the page to see if it is a multi-version page, or a single-version page. If it's multi-version, it would scrape the version links and select the highest-numbered version that meets all of your criteria. Once it has a single-version page, it would look for a download URL, and see if its filename is that of an archive (.egg, .tar, .tgz, etc.) or if the URL is for subversion. If so, we assume it's the right thing and invoke the callback to do the install. If not, then we follow the link anyway, and scrape for links to archives, checking versions when we get there if possible. If there's still nothing suitable (or there was no download URL), we apply the same procedure to the homepage URL. This should suffice to make a significant number of packages available from PyPI with autodownload, and packages with dependencies would also be downloaded, built, and installed. The hardest parts of this aren't in the screen-scraping per se; it's more in the heuristics for evaluating whether a specific URL is suitable for download. Many PyPI download URLs are of the form "foopackage-latest.tgz", so it's not possible to determine a usable version number from this, unless I special-case "latest" in the version parser -- which I guess I could do. We also probably need some kind of heuristic to determine which URLs are "better" to try, as we don't want to just run through the links in order. Hm. You know, what if as an interim step we had the command-line tool just launch a webbrowser pointing you to PyPI? Getting to a page for a suitable version is easy, so we could then let the user find the right download URL and then go back to paste it on the command line. That could be a nice interim addition, although it isn't much of a solution for packages with a lot of un-installed dependencies. You'd keep getting kicked back to the web browser a lot, and more to the point you'd have to keep restarting the tool. So, ultimately we really need a way to actually find the URLs. There are going to have to be new options for the tool, too. Like a way to set the PyPI URL to use, and a way to specify what sort of package revisions are acceptable (e.g. no alphas, no betas, no snapshots).

Phillip J. Eby wrote:
Really this would be very easy to add to PyPI as a set of xml-rpc calls, once the server move gets resolved. I could develop it now, except that the last pypi database dump I have is from before the move to Postgres, and there've been a number of changes since then, and I'd like some test data. I also have a checkout from CVS, but the SF project no longer lists CVS and svn.python.org doesn't have public access yet, so I can't point you to the code. But anyway, if someone on the new or old server can just run pg_dump on that database and email me the results (or a url to those results) that would be very helpful. Getting the data without screenscraping won't instantly give us all the necessary information. But it does contain good information about available versions, what the active version is, and per-version download URLs (which, if nothing else, could be compared against each other to detect non-version-specific URLs).
That's not a very satisfying experience -- the person might as well just download the file at that point. Even with accurate data from PyPI, it's still likely there will be multiple possible URLs. At that point, at least if you are going through the command line, displaying all the URLs (numbered) and asking the user would probably give the user enough information to choose. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

At 08:47 PM 5/30/2005 -0500, Ian Bicking wrote:
Right; but none of that helps with the real problem (from EasyInstall's perspective), which is that the current incarnation of PyPI doesn't list multiple download URLs for a single release of a specific package. For example, when I release PEAK or PyProtocols, I've been releasing sdists (in two formats) plus a bdist_wininst -- and in the future I'll probably drop the bdist_wininst in favor of eggs. But I can't put any of that info on PyPI, so I just link to my downloads directory - as do 25% of the packages I surveyed in a random sampling last week. In order to get at packages like those, a flexible screen scraper is a must. I agree that PyPI should have better handling of download URLs, but I'm in a lot better position to improve EasyInstall than PyPI.
Tastes differ, I suppose. I'd just right-click the link to copy it, and then alt-tab, ^R, space, ^K, space, shift-insert, ENTER. But then, I've been downloading a lot of packages this weekend, so that sequence is already in my muscle memory. :) Hm. Maybe somebody could create a Firefox extension that runs EasyInstall on a selected link. :)
In which case you might as well be back in the web browser. I'm fine with options being available to fine-tune the selection process, but the criteria can and should be mechanically processed. After all unusable versions, platforms, and archive types are eliminated, the prioritization should be in descending version order, with same version archives sorted by archive type, eggs first, everything else second. (Since the eggs don't need to be built.) By the way, in all of this there's been no discussion about MD5 signatures or code signing. That's probably because I don't know a whole lot about that subject. :) But I'm certainly interested in hearing from those who do.

At 04:10 PM 5/30/2005 -0500, Ian Bicking wrote:
FYI, just so you know, your implementation won't handle nested dependencies, nor allow specifying optional features of requested packages. I'd suggest that in a later version, you take a look at subclassing pkg_resources.AvailableDistributions, and overriding the 'obtain()' method to do the search and installation. In this way, your 'obtain()' method will get called for any dependencies that the originally-requested package requires. To see what I mean, take a look at the source of pkg_resources.require(), which looks like this: requirements = parse_requirements(requirements) for dist in AvailableDistributions().resolve(requirements): dist.install_on(sys.path) So, if you subclass AvailableDistributions to define an 'obtain()' method, and then write a similar loop using that subclass, you'll be able to cleanly integrate the auto-download in a forward-compatible way for packages that declare dependencies in their PackageName.egg-info.
That might be a good candidate for egg metadata; if a package is a Paste plugin, you could require a 'paste-install' file in the package's EGG-INFO. Note that the AvailableDistributions().resolve() method returns a list of Distribution objects, and distribution objects have a 'metadata' attribute that implements 'IMetadataProvider'. So, 'aDistribution.metdata.has_metadata("paste-install")' will tell you if that distribution has a "paste-install" file in its EGG-INFO, and using the get_metadata() method will give it to you as a string. Thus, you can do something like: for dist in MyDownloadDistributions().resolve(requirements): dist.install_on(sys.path) if dist.metadata.has_metadata('paste-install'): doSomething(dist.metadata.get_metadata('paste-install')) If you use this algorithm, it will work for all egg varieties: compressed, uncompressed, and "development" eggs. You will, however, need to keep track of which paste-install scripts you've already processed, because 'resolve()' can yield distributions that are already present on sys.path. Also, you may want to create and cache a single instance of your MyDownloadDistributions class, because creating one does a lot of filesystem stats and listdirs and such. Of course, if you have people doing TheirPackage.egg-info/paste-install, they can also create TheirPackage.egg-info/depends.txt, and list all their dependencies there, with no need to use 'use_package'. It won't help with download URLs, though. But we could perhaps define an EGG-INFO/download_urls.txt as a stopgap, that lists known download urls... Anyway, as you can see, eggs were definitely designed with plugin systems like this in mind. :)
Hopefully within the next 24 hours or so. It will also include sandboxing support (automatically aborts the install if the package tries to write to the filesystem outside the build directory), and lots of workarounds to support various packages out there that have quirky install_data subclasses. Those items are already done, but items still on my to-do list include: * help message to explain how to use require() for multi-version/instdir installs * a --build-dir/-b option to set the build directory, that will leave the downloaded package and its extracted contents in place after the installation is complete. (So you can read docs, install scripts, or debug a failed installation.) And I'd like to do something about scripts, but I think that's going to get left to an 0.4a1 release, assuming there are no further bug fix releases needed in the 0.3 line.

0.3a3 is now released, with a new --build-dir option, sandboxing, more package workarounds, SourceForge mirror selection, and "installation reports". See: http://peak.telecommunity.com/DevCenter/EasyInstall#release-notes-change-his... for more details. I'm thinking that adding automatic package location via PyPI is probably pretty doable now, by the way. My plan is to create a PackageFinder class (subclassing AvailableDistributions) whose obtain() method searches for the desired package on PyPI, keeping a cache of URLs it has already seen. (It would also accept a callback argument that it would use to create Installer objects when it needs to install packages.) The command-line tool (easy_install.main) would create a PackageFinder with an interactive installation callback, and in the main loop it would pass it to each new Installer instance. The Installer would then use it whenever it gets a non-file, non-URL command line option, and use it to resolve() such requests. The PackageFinder.obtain() method would go to the PyPI base URL followed by the desired distribution name, e.g. 'http://www.python.org/pypi/SQLObject', and then scrape the page to see if it is a multi-version page, or a single-version page. If it's multi-version, it would scrape the version links and select the highest-numbered version that meets all of your criteria. Once it has a single-version page, it would look for a download URL, and see if its filename is that of an archive (.egg, .tar, .tgz, etc.) or if the URL is for subversion. If so, we assume it's the right thing and invoke the callback to do the install. If not, then we follow the link anyway, and scrape for links to archives, checking versions when we get there if possible. If there's still nothing suitable (or there was no download URL), we apply the same procedure to the homepage URL. This should suffice to make a significant number of packages available from PyPI with autodownload, and packages with dependencies would also be downloaded, built, and installed. The hardest parts of this aren't in the screen-scraping per se; it's more in the heuristics for evaluating whether a specific URL is suitable for download. Many PyPI download URLs are of the form "foopackage-latest.tgz", so it's not possible to determine a usable version number from this, unless I special-case "latest" in the version parser -- which I guess I could do. We also probably need some kind of heuristic to determine which URLs are "better" to try, as we don't want to just run through the links in order. Hm. You know, what if as an interim step we had the command-line tool just launch a webbrowser pointing you to PyPI? Getting to a page for a suitable version is easy, so we could then let the user find the right download URL and then go back to paste it on the command line. That could be a nice interim addition, although it isn't much of a solution for packages with a lot of un-installed dependencies. You'd keep getting kicked back to the web browser a lot, and more to the point you'd have to keep restarting the tool. So, ultimately we really need a way to actually find the URLs. There are going to have to be new options for the tool, too. Like a way to set the PyPI URL to use, and a way to specify what sort of package revisions are acceptable (e.g. no alphas, no betas, no snapshots).

Phillip J. Eby wrote:
Really this would be very easy to add to PyPI as a set of xml-rpc calls, once the server move gets resolved. I could develop it now, except that the last pypi database dump I have is from before the move to Postgres, and there've been a number of changes since then, and I'd like some test data. I also have a checkout from CVS, but the SF project no longer lists CVS and svn.python.org doesn't have public access yet, so I can't point you to the code. But anyway, if someone on the new or old server can just run pg_dump on that database and email me the results (or a url to those results) that would be very helpful. Getting the data without screenscraping won't instantly give us all the necessary information. But it does contain good information about available versions, what the active version is, and per-version download URLs (which, if nothing else, could be compared against each other to detect non-version-specific URLs).
That's not a very satisfying experience -- the person might as well just download the file at that point. Even with accurate data from PyPI, it's still likely there will be multiple possible URLs. At that point, at least if you are going through the command line, displaying all the URLs (numbered) and asking the user would probably give the user enough information to choose. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

At 08:47 PM 5/30/2005 -0500, Ian Bicking wrote:
Right; but none of that helps with the real problem (from EasyInstall's perspective), which is that the current incarnation of PyPI doesn't list multiple download URLs for a single release of a specific package. For example, when I release PEAK or PyProtocols, I've been releasing sdists (in two formats) plus a bdist_wininst -- and in the future I'll probably drop the bdist_wininst in favor of eggs. But I can't put any of that info on PyPI, so I just link to my downloads directory - as do 25% of the packages I surveyed in a random sampling last week. In order to get at packages like those, a flexible screen scraper is a must. I agree that PyPI should have better handling of download URLs, but I'm in a lot better position to improve EasyInstall than PyPI.
Tastes differ, I suppose. I'd just right-click the link to copy it, and then alt-tab, ^R, space, ^K, space, shift-insert, ENTER. But then, I've been downloading a lot of packages this weekend, so that sequence is already in my muscle memory. :) Hm. Maybe somebody could create a Firefox extension that runs EasyInstall on a selected link. :)
In which case you might as well be back in the web browser. I'm fine with options being available to fine-tune the selection process, but the criteria can and should be mechanically processed. After all unusable versions, platforms, and archive types are eliminated, the prioritization should be in descending version order, with same version archives sorted by archive type, eggs first, everything else second. (Since the eggs don't need to be built.) By the way, in all of this there's been no discussion about MD5 signatures or code signing. That's probably because I don't know a whole lot about that subject. :) But I'm certainly interested in hearing from those who do.
participants (2)
-
Ian Bicking
-
Phillip J. Eby