[Python-Dev] setuptools: past, present, future

Sun Apr 23 03:04:57 CEST 2006

At 08:12 PM 4/22/2006 -0400, Terry Reedy wrote:
>If my premises above are mistaken, then the suggestions should be modified
>or discarded.  However, I don't see how they conflict at all with a
>consumer rating system.

My point was simply that providing rapid, visible feedback to authors 
results in a shorter feedback loop with less infrastructure.

Also, after thinking it over, it's clear that the spidering is never going 
to be able to come out entirely, because there are lots of valid use cases 
for people effectively setting up their own mini-indexes.  All that will 
happen is that at some point I'll be able to stop adding 
heuristics.  (Hopefully that point is already past, in fact.)

For anybody that wants to know how the current heuristics work, EasyInstall 
actually only has a few main categories of heuristics used to find packages:

* Ones that apply to PyPI
* Ones that apply to SourceForge
* Ones that interpret distutils-generated filenames
* The one that detects when a page is really a Subversion directory, and 
thus should be checked out instead of downloaded

Most of the SourceForge heuristics have been eliminated already, except for 
the translation of prdownloads.sf.net URLs to dl.sourceforge.net URLs, and 
automatic retries in the event of a broken mirror.

I'm about to begin modifying the PyPI heuristics to use the new XML-RPC 
interface instead, for the most part.  (Although finding links in a 
package's long description will still be easier via the web 
interface.)  And the distutils haven't started generating any new kinds of 
filenames lately, although I occasionally run into situations where 
non-distutils links or obscure corner cases of distutils filenames give 
problems, or where somebody has filenames that look like they came from the 
distutils, but the contents aren't a valid distutils source distribution.

Anyway, these are the only things that are truly heuristic in the sense 
that they are attempts to guess well, and there is always the potential for 
failure or obsolescence if PyPI or SourceForge or Subversion changes, or 
people do strange things with their links.

I should probably also point out that calling this "spidering" may give the 
impression it's more sophisticated than it is.  EasyInstall only retrieves 
pages that it is explicitly given, or which appear in one of two specific 
parts of a PyPI listing.  But it *scans* links on any page that it 
retrieves, and if a link looks like a downloadable package, it will parse 
as much info as practical from the filename in order to catalog it as a 
possible download source.  So, it will read HTML from PyPI pages, pages 
directly linked from PyPI as either "Home" or "Download" URLs, and page 
URLs you give to --find-links.  But it doesn't "spider" anywhere besides 
those pages, unless you count downloading an actual package link.  The 
whole process more resembles a few quick redirects in a browser, than it 
does any sort of traditional web spider.