[Distutils] [Catalog-sig] Specification for package indexes?
Jim Fulton
jim at zope.com
Fri Jul 7 19:32:42 CEST 2006
On Jul 7, 2006, at 12:18 PM, Phillip J. Eby wrote:
> At 06:55 AM 7/7/2006 -0400, Jim Fulton wrote:
>> From a design perspective:
>>
>> a. screen scraping is bad
>
> As long as you define "screen scraping" as "dependency on visible
> characteristics of HTML", then I agree. easy_install shouldn't be
> relying on the visible bits of HTML that it currently uses to scope
> out PyPI.
Yup
> Relying on a particular URL layout is not screen-scraping, though,
> and using the URL layout as part of the API has some good
> properties for ease of implementation in static form. So does
> using href's to obtain link information.
Yes.
> What we should be doing is adding non-visible markup (e.g. class=""
> or rel="") information to the links to allow index creators to
> direct easy_install without affecting visible page characteristics.
Yes
>> b. the web API should be simple and well defined.
>>
>> I suggest, as others have suggested, that we create an *alternate*
>> web API for reading an index focussed on cleanliness and on making
>> the API as easy as possible to implement for both index and client
>> developers. If we agree with all of the goals stated above, I think
>> this should be static HTTP interface using XHTML or some other XML
>> dialect. Perhaps we could even use specific HTML class attrs to
>> make it possible to combine the pypi and user interfaces if an index
>> implementor desires.
>>
>> Thoughts?
>
> +1 on static pages. I don't, however, see a reason to require
> valid XML. Or rather, I don't expect to implement XML parsing in
> easy_install; if the spec is too complex to implement with regular
> expression matching, it's probably too complex for people to throw
> together an index with what's at hand. In particular, I'd like it
> to be practical to put together a simple index just using Apache's
> built-in directory indexes, as long as they use the right URL
> hierarchy. That means that class or rel attributes should only be
> required for links that are requesting non-index pages to be spidered.
I would find parsing much easier with an XML parser than with
regular expressions.
I think it would be much more robust too.
I do want to see something that is well documented and pretty easy to
implement.
Jim
--
Jim Fulton mailto:jim at zope.com Python Powered!
CTO (540) 361-1714 http://www.python.org
Zope Corporation http://www.zope.com http://www.zope.org
More information about the Distutils-SIG
mailing list