[Python-Dev] setuptools: past, present, future
Phillip J. Eby
pje at telecommunity.com
Sun Apr 23 03:04:57 CEST 2006
At 08:12 PM 4/22/2006 -0400, Terry Reedy wrote:
>If my premises above are mistaken, then the suggestions should be modified
>or discarded. However, I don't see how they conflict at all with a
>consumer rating system.
My point was simply that providing rapid, visible feedback to authors
results in a shorter feedback loop with less infrastructure.
Also, after thinking it over, it's clear that the spidering is never going
to be able to come out entirely, because there are lots of valid use cases
for people effectively setting up their own mini-indexes. All that will
happen is that at some point I'll be able to stop adding
heuristics. (Hopefully that point is already past, in fact.)
For anybody that wants to know how the current heuristics work, EasyInstall
actually only has a few main categories of heuristics used to find packages:
* Ones that apply to PyPI
* Ones that apply to SourceForge
* Ones that interpret distutils-generated filenames
* The one that detects when a page is really a Subversion directory, and
thus should be checked out instead of downloaded
Most of the SourceForge heuristics have been eliminated already, except for
the translation of prdownloads.sf.net URLs to dl.sourceforge.net URLs, and
automatic retries in the event of a broken mirror.
I'm about to begin modifying the PyPI heuristics to use the new XML-RPC
interface instead, for the most part. (Although finding links in a
package's long description will still be easier via the web
interface.) And the distutils haven't started generating any new kinds of
filenames lately, although I occasionally run into situations where
non-distutils links or obscure corner cases of distutils filenames give
problems, or where somebody has filenames that look like they came from the
distutils, but the contents aren't a valid distutils source distribution.
Anyway, these are the only things that are truly heuristic in the sense
that they are attempts to guess well, and there is always the potential for
failure or obsolescence if PyPI or SourceForge or Subversion changes, or
people do strange things with their links.
I should probably also point out that calling this "spidering" may give the
impression it's more sophisticated than it is. EasyInstall only retrieves
pages that it is explicitly given, or which appear in one of two specific
parts of a PyPI listing. But it *scans* links on any page that it
retrieves, and if a link looks like a downloadable package, it will parse
as much info as practical from the filename in order to catalog it as a
possible download source. So, it will read HTML from PyPI pages, pages
directly linked from PyPI as either "Home" or "Download" URLs, and page
URLs you give to --find-links. But it doesn't "spider" anywhere besides
those pages, unless you count downloading an actual package link. The
whole process more resembles a few quick redirects in a browser, than it
does any sort of traditional web spider.
More information about the Python-Dev
mailing list