Re: [Distutils] [Catalog-sig] Specification for package indexes?
Phillip J. Eby <pje@telecommunity.com> wrote: Why not? ;)
That was actually what I was afraid the reasoning was ;) I guess I just go all wobbly in the knees at the thought of having to maintain a "screen scraping" interface. Funnily enough, Johannes Gisjbers, Andrew Dalke and I were talking about this very issue last night. I proposed that we detect the user-agent of the setuptools client, and in response send back really minimalist HTML (no surrounding page template). Probably overkill, but this may have been after we'd had beer :) Could you provide a clear list of all the specific changes you wish for us to make at the Sprint?
Nonetheless, there are various aspects of easy_install's behavior and performance that could be significantly improved by using XML-RPC, so I definitely want it to do that in 0.7. I'm just wary of removing the existing behavior until it's clear that it's unnecessary for it to.
Oh - another thing that occurred to me -- does setuptools auto update itself? Richard
At 01:03 AM 7/7/2006 +1000, richardjones@optusnet.com.au wrote:
Phillip J. Eby <pje@telecommunity.com> wrote: Why not? ;)
That was actually what I was afraid the reasoning was ;)
I guess I just go all wobbly in the knees at the thought of having to maintain a "screen scraping" interface.
You don't need to -- at least not in the long term. Once setuptools 0.7 supports the XML-RPC interface, it won't need the other scraping tricks to read PyPI. Those would be left in for people who are creating their own package indexes, not constraining further development of PyPI itself. Please keep in mind that easy_install makes *extremely* minimal assumptions about PyPI's interface: 1. It assumes that baseURL/projectname will get to the current version of projectname, or a page with a list of projectname's active versions 2. It assumes that links within PyPI of the form baseURL/something1/something2 are links to version 'something2' of a project named 'something1' 3. It assumes that going to baseURL directly will result in a page with links to all available packages in the form described in #2. 4. It assumes that if baseURL/projectname returns a page containing the text "Index of Packages</title>", it is a list of links of the form described in #2. 5. It looks for and follows the first links following the strings "<th>Home Page" and "<th>Download URL" in a project page. 6. It makes assumptions about how to find MD5 data on a PyPI page, but if it fails to do so, it simply won't check the MD5 of downloads. Also note that even with an XML-RPC interface, easy_install will *still* need to read an HTML page to gather links, because it's valid for people to provide links in their long_description using reStructuredText. It's just that assumptions 1, 3, and 4 (and maybe 5) would not be necessary. Also note that in a pinch, you can put the strings easy_install is looking for inside HTML comments. Easy_install really isn't that bright. ;) However, if you can provide *all* of this data via the API (including an html-formatted long description), then the screen scraping can go away as far as PyPI is concerned.
Funnily enough, Johannes Gisjbers, Andrew Dalke and I were talking about this very issue last night. I proposed that we detect the user-agent of the setuptools client, and in response send back really minimalist HTML (no surrounding page template). Probably overkill, but this may have been after we'd had beer :)
There's a simpler solution that could be implemented: adding a 'rel="easy-install"' attribute to links that easy_install should follow. Currently, those links are the project's home page URL, download URL, and the links to specific versions that show up when you go to a project that has multiple active versions. Adding it to these, and *only* these links would give easy_install enough information to do the right thing. However, support would have to wait for setuptools 0.7 anyhow, so there's little reason to do this. Hm. I just tried to make multiple versions of PEAK active, and it seems like you can't get the page that lists multiple versions any more. No wonder some people have been having problems downloading older versions of certain packages. :( How are people supposed to get to older package versions now? That is, what's the point of being able to have multiple active versions if you can't find them? Is this an intended change, or a bug?
Could you provide a clear list of all the specific changes you wish for us to make at the Sprint?
I've provided a list above of what changes I want you *not* to make. How's that? ;)
Nonetheless, there are various aspects of easy_install's behavior and performance that could be significantly improved by using XML-RPC, so I definitely want it to do that in 0.7. I'm just wary of removing the existing behavior until it's clear that it's unnecessary for it to.
Oh - another thing that occurred to me -- does setuptools auto update itself?
What do you mean? You can run "easy_install -u setuptools" to upgrade to the latest release at any time. But it doesn't go out looking for updates on its own.
I'd like to suggest that we take a step back. It feels as though we are reacting rather than designing. I think we have the following goals: 1. setuptools should be able to read indexes robustly and efficiently. 2. It should be straightforward, and preferably *easy* for people to implement their own indexes. This is very important to me. :) Perhaps: 3. It should be easy to mirror an index 4. It should be possible to create a read index as a static HTTP server. And I suggest: 5. It should be possible to provide an end-user experience for an index without affecting the setuptools interface 4. It should be possible to write other setuptools-like applications for accessing indexes. This means that the web-service (small w-s) should be well defined and/or that setuptools should expose a Python API for accessing indexes. From a design perspective: a. screen scraping is bad b. the web API should be simple and well defined. I suggest, as others have suggested, that we create an *alternate* web API for reading an index focussed on cleanliness and on making the API as easy as possible to implement for both index and client developers. If we agree with all of the goals stated above, I think this should be static HTTP interface using XHTML or some other XML dialect. Perhaps we could even use specific HTML class attrs to make it possible to combine the pypi and user interfaces if an index implementor desires. Thoughts? Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org
At 06:55 AM 7/7/2006 -0400, Jim Fulton wrote:
From a design perspective:
a. screen scraping is bad
As long as you define "screen scraping" as "dependency on visible characteristics of HTML", then I agree. easy_install shouldn't be relying on the visible bits of HTML that it currently uses to scope out PyPI. Relying on a particular URL layout is not screen-scraping, though, and using the URL layout as part of the API has some good properties for ease of implementation in static form. So does using href's to obtain link information. What we should be doing is adding non-visible markup (e.g. class="" or rel="") information to the links to allow index creators to direct easy_install without affecting visible page characteristics.
b. the web API should be simple and well defined.
I suggest, as others have suggested, that we create an *alternate* web API for reading an index focussed on cleanliness and on making the API as easy as possible to implement for both index and client developers. If we agree with all of the goals stated above, I think this should be static HTTP interface using XHTML or some other XML dialect. Perhaps we could even use specific HTML class attrs to make it possible to combine the pypi and user interfaces if an index implementor desires.
Thoughts?
+1 on static pages. I don't, however, see a reason to require valid XML. Or rather, I don't expect to implement XML parsing in easy_install; if the spec is too complex to implement with regular expression matching, it's probably too complex for people to throw together an index with what's at hand. In particular, I'd like it to be practical to put together a simple index just using Apache's built-in directory indexes, as long as they use the right URL hierarchy. That means that class or rel attributes should only be required for links that are requesting non-index pages to be spidered.
On Jul 7, 2006, at 12:18 PM, Phillip J. Eby wrote:
At 06:55 AM 7/7/2006 -0400, Jim Fulton wrote:
From a design perspective:
a. screen scraping is bad
As long as you define "screen scraping" as "dependency on visible characteristics of HTML", then I agree. easy_install shouldn't be relying on the visible bits of HTML that it currently uses to scope out PyPI.
Yup
Relying on a particular URL layout is not screen-scraping, though, and using the URL layout as part of the API has some good properties for ease of implementation in static form. So does using href's to obtain link information.
Yes.
What we should be doing is adding non-visible markup (e.g. class="" or rel="") information to the links to allow index creators to direct easy_install without affecting visible page characteristics.
Yes
b. the web API should be simple and well defined.
I suggest, as others have suggested, that we create an *alternate* web API for reading an index focussed on cleanliness and on making the API as easy as possible to implement for both index and client developers. If we agree with all of the goals stated above, I think this should be static HTTP interface using XHTML or some other XML dialect. Perhaps we could even use specific HTML class attrs to make it possible to combine the pypi and user interfaces if an index implementor desires.
Thoughts?
+1 on static pages. I don't, however, see a reason to require valid XML. Or rather, I don't expect to implement XML parsing in easy_install; if the spec is too complex to implement with regular expression matching, it's probably too complex for people to throw together an index with what's at hand. In particular, I'd like it to be practical to put together a simple index just using Apache's built-in directory indexes, as long as they use the right URL hierarchy. That means that class or rel attributes should only be required for links that are requesting non-index pages to be spidered.
I would find parsing much easier with an XML parser than with regular expressions. I think it would be much more robust too. I do want to see something that is well documented and pretty easy to implement. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org
On 7/7/06, Jim Fulton <jim@zope.com> wrote:
+1 on static pages. I don't, however, see a reason to require valid XML. Or rather, I don't expect to implement XML parsing in easy_install; if the spec is too complex to implement with regular expression matching, it's probably too complex for people to throw together an index with what's at hand. In particular, I'd like it to be practical to put together a simple index just using Apache's built-in directory indexes, as long as they use the right URL hierarchy. That means that class or rel attributes should only be required for links that are requesting non-index pages to be spidered.
I would find parsing much easier with an XML parser than with regular expressions. I think it would be much more robust too.
XHTML would be best, though I agree we shouldn't care about validity so much as just well-formedness (which is required). I think it should be possible to do it with valid XHTML, though, since whether that's desired or not is a python.org policy concern. (Not that I suspect we'll ever really care about that.) Of course, it should be possible to parse with htmllib and HTMLParser as well. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "Every sin is the result of a collaboration." --Lucius Annaeus Seneca
participants (4)
-
Fred Drake
-
Jim Fulton
-
Phillip J. Eby
-
richardjones@optusnet.com.au