[Catalog-sig] [Distutils] Specification for package indexes?

Phillip J. Eby pje at telecommunity.com
Fri Jul 7 18:18:32 CEST 2006


At 06:55 AM 7/7/2006 -0400, Jim Fulton wrote:
> From a design perspective:
>
>a. screen scraping is bad

As long as you define "screen scraping" as "dependency on visible 
characteristics of HTML", then I agree.  easy_install shouldn't be relying 
on the visible bits of HTML that it currently uses to scope out PyPI.

Relying on a particular URL layout is not screen-scraping, though, and 
using the URL layout as part of the API has some good properties for ease 
of implementation in static form.  So does using href's to obtain link 
information.

What we should be doing is adding non-visible markup (e.g. class="" or 
rel="") information to the links to allow index creators to direct 
easy_install without affecting visible page characteristics.


>b. the web API should be simple and well defined.
>
>I suggest, as others have suggested, that we create an *alternate*
>web API for reading an index focussed on cleanliness and on making
>the API as easy as possible to implement for both index and client
>developers.  If we agree with all of the goals stated above, I think
>this should be static HTTP interface using XHTML or some other XML
>dialect.   Perhaps we could even use specific HTML class attrs to
>make it possible to combine the pypi and user interfaces if an index
>implementor desires.
>
>Thoughts?

+1 on static pages.  I don't, however, see a reason to require valid 
XML.  Or rather, I don't expect to implement XML parsing in easy_install; 
if the spec is too complex to implement with regular expression matching, 
it's probably too complex for people to throw together an index with what's 
at hand.  In particular, I'd like it to be practical to put together a 
simple index just using Apache's built-in directory indexes, as long as 
they use the right URL hierarchy.  That means that class or rel attributes 
should only be required for links that are requesting non-index pages to be 
spidered.



More information about the Catalog-sig mailing list