[Distutils] [Catalog-sig] Specification for package indexes?

Jim Fulton jim at zope.com
Fri Jul 7 19:32:42 CEST 2006


On Jul 7, 2006, at 12:18 PM, Phillip J. Eby wrote:

> At 06:55 AM 7/7/2006 -0400, Jim Fulton wrote:
>> From a design perspective:
>>
>> a. screen scraping is bad
>
> As long as you define "screen scraping" as "dependency on visible  
> characteristics of HTML", then I agree.  easy_install shouldn't be  
> relying on the visible bits of HTML that it currently uses to scope  
> out PyPI.

Yup
> Relying on a particular URL layout is not screen-scraping, though,  
> and using the URL layout as part of the API has some good  
> properties for ease of implementation in static form.  So does  
> using href's to obtain link information.

Yes.

> What we should be doing is adding non-visible markup (e.g. class=""  
> or rel="") information to the links to allow index creators to  
> direct easy_install without affecting visible page characteristics.

Yes

>> b. the web API should be simple and well defined.
>>
>> I suggest, as others have suggested, that we create an *alternate*
>> web API for reading an index focussed on cleanliness and on making
>> the API as easy as possible to implement for both index and client
>> developers.  If we agree with all of the goals stated above, I think
>> this should be static HTTP interface using XHTML or some other XML
>> dialect.   Perhaps we could even use specific HTML class attrs to
>> make it possible to combine the pypi and user interfaces if an index
>> implementor desires.
>>
>> Thoughts?
>
> +1 on static pages.  I don't, however, see a reason to require  
> valid XML.  Or rather, I don't expect to implement XML parsing in  
> easy_install; if the spec is too complex to implement with regular  
> expression matching, it's probably too complex for people to throw  
> together an index with what's at hand.  In particular, I'd like it  
> to be practical to put together a simple index just using Apache's  
> built-in directory indexes, as long as they use the right URL  
> hierarchy.  That means that class or rel attributes should only be  
> required for links that are requesting non-index pages to be spidered.

I would find parsing much easier with an XML parser  than with  
regular expressions.
I  think it would be much more robust too.

I do want to see something that is well documented and pretty easy to  
implement.

Jim

--
Jim Fulton			mailto:jim at zope.com		Python Powered!
CTO 				(540) 361-1714			http://www.python.org
Zope Corporation	http://www.zope.com		http://www.zope.org





More information about the Distutils-SIG mailing list