distlib locator compares reasonably well with pip's PackageFinder
As an extended test of the locator functionality in distlib, I compared the results from the locate() API with the results from pip's PackageFinder. The results seem encouraging: out of 24938 packages tested, only 510 (2%) gave different results from pip. Out of the 510, many are due to: 1. distlib (currently) looks for archives which are machine-independent, and skips archives which contain machine-dependent strings in their names. 2. distlib (currently) skips archives automatically generated by sites like GitHub and BitBucket, since the download archive doesn't always contain version information in the name of the download. Examples: http://github.com/user/project/tarball/master http://bitbucket.org/user/project/get/tip.zip In a (much) smaller number of cases, the differences are because locate() found an archive where pip didn't. But for 98% of the projects registered on PyPI, locate() returns identical results to pip's PackageFinder.find_requirement, and does it slightly faster in most cases :-) The comparison script, along with the list of differences, is at https://gist.github.com/3951634 Regards, Vinay Sajip
On Thu, Oct 25, 2012 at 5:38 AM, Vinay Sajip
2. distlib (currently) skips archives automatically generated by sites like GitHub and BitBucket, since the download archive doesn't always contain version information in the name of the download. Examples:
http://github.com/user/project/tarball/master http://bitbucket.org/user/project/get/tip.zip
Just as an FYI, are you aware of the #egg=projname-version tagging convention currently in use for such links?
PJ Eby
Just as an FYI, are you aware of the #egg=projname-version tagging convention currently in use for such links?
I wasn't - thanks - still new to this game :-) In the cases I was referring to, the fragment looks like #egg=projname-dev. I still have to look through the 2% where distlib's behaviour is different and check what the reason is. I'm not sure if we should, as default behaviour, identify such archives as potential downloads, especially as a goal is to do dependency resolution without having to actually download archives and inspecting PKG-INFO / running egg-info on them etc. Of course, this verification could be done during actual installation. Regards, Vinay Sajip
On Sat, Oct 27, 2012 at 5:57 AM, Vinay Sajip
PJ Eby
writes: Just as an FYI, are you aware of the #egg=projname-version tagging convention currently in use for such links?
I wasn't - thanks - still new to this game :-) In the cases I was referring to, the fragment looks like #egg=projname-dev.
"dev" is the version, actually. It's a perfectly valid version to setuptools, and parses as a version that's below any commonly-used version. This lets people specify "==dev" to target an in-development version for installation -- usually manually, but sometimes automatically. One might specify, for example "foobar>2.0,==dev" to tell setuptools that if you can't find a released version >2.0, then an in-development version is acceptable.
I'm not sure if we should, as default behaviour, identify such archives as potential downloads,
If they're using a "dev" version, then such a link is automatically lower-precedence than anything else already, due to it being the lowest available version.(Newer tools could treat it as 0dev or whatever the official translation/suggestion is.) In addition, it denotes a "non-stable" version, so if the tool allows one to prioritize stable versions, it'll be eliminated as a candidate anyway in that case. If the version tag is precise, OTOH, (i.e., something other than 'dev'), then presumably the provider of the link can be trusted to have identified what version it is. IIUC, those source control sites let you download tarballs of arbitrary versions, so one could in principle issue download releases of exact source snapshots. (Indeed, it's not a bad way to go about it.)
PJ Eby
"dev" is the version, actually. It's a perfectly valid version to setuptools, and parses as a version that's below any commonly-used version. This lets people specify "==dev" to target an in-development version for installation -- usually manually, but sometimes automatically. One might specify, for example "foobar>2.0,==dev" to tell setuptools that if you can't find a released version >2.0, then an in-development version is acceptable.
I get that, but the question is, should such a version (which could be completely different tomorrow) appear by default e.g. when doing dependency resolution, where no ==dev is likely to be specified?
If they're using a "dev" version, then such a link is automatically lower-precedence than anything else already, due to it being the lowest available version.(Newer tools could treat it as 0dev or whatever the official translation/suggestion is.) In addition, it denotes a "non-stable" version, so if the tool allows one to prioritize stable versions, it'll be eliminated as a candidate anyway in that case.
In the failure cases I was noting earlier, there were no other versions. I'm looking at the locate() mechanism not just as invoked directly from a user request to install (where ==dev might well be specified), but also via attempting to resolve dependencies - when it's not clear whether a version such as "dev" should really be included.
If the version tag is precise, OTOH, (i.e., something other than 'dev'), then presumably the provider of the link can be trusted to have identified what version it is. IIUC, those source control sites let you download tarballs of arbitrary versions, so one could in principle issue download releases of exact source snapshots. (Indeed, it's not a bad way to go about it.)
In cases where the URL specifies an archive named in the conventional way (name-version.ext), then these are currently picked up, even from DVCS hosting sites. What are not picked up are links which may contain the version number, but not necessarily using strictly defined conventions. In such cases, I would prefer not to add too many regex-style checks for various schemes which might be in use, but just provide a minimal way for users who need it to slot in these schemes via subclassing. After all, with the current setup, distlib is picking the same download URL as pip for 98% of cases. N.B. with the #egg=name-version scheme, as the fragment portion is not sent to the server, there's no way to expect that the server will always return a specific version (as no version is specified in the URL which goes to the server). At least that's not the case when you ask for a resource with a specific project name and version in the name of the resource itself. Regards, Vinay Sajip
On Sat, Oct 27, 2012 at 7:08 PM, Vinay Sajip
N.B. with the #egg=name-version scheme, as the fragment portion is not sent to the server, there's no way to expect that the server will always return a specific version (as no version is specified in the URL which goes to the server).
That's just my point: version control sites (even ViewSVN) let you specify the version in the URL, just not in the filename portion. So it's quite valid to create a link to download a specifically designated version by specifying a revision control version, and then indicate the project name and version in the fragment identifier. IOW, the whole point of the fragment tag is to let you label a link that you know is to a specific version of a package, despite the filename portion not being explicit or unambiguous as to project name and version.
What are not picked up are links which may contain the version number, but not necessarily using strictly defined conventions. In such cases, I would prefer not to add too many regex-style checks for various schemes which might be in use, but just provide a minimal way for users who need it to slot in these schemes via subclassing.
What I'm pointing out is that the existing #egg system provides a way to make links' project/version info unambiguous, so that you don't *need* lots of regex-based schemes, and people can just explicitly label their packages, even if due to hosting requirements they can't actually name their distributions according to the official scheme. (And this means that people won't need to be constantly writing plugins and subclasses to handle ever-more obscure hosting conventions.) In the future, of course, it might make sense to replace #egg with something like #pydist as the blessed syntax, while still processing #egg for backward-compatibility.
participants (2)
-
PJ Eby
-
Vinay Sajip