Re: [Distutils] [Catalog-sig] Specification for package indexes?

At 02:04 PM 7/7/2006 -0400, Fred Drake wrote:
On 7/7/06, Jim Fulton <jim@zope.com> wrote:
+1 on static pages. I don't, however, see a reason to require valid XML. Or rather, I don't expect to implement XML parsing in easy_install; if the spec is too complex to implement with regular expression matching, it's probably too complex for people to throw together an index with what's at hand. In particular, I'd like it to be practical to put together a simple index just using Apache's built-in directory indexes, as long as they use the right URL hierarchy. That means that class or rel attributes should only be required for links that are requesting non-index pages to be spidered.
I would find parsing much easier with an XML parser than with regular expressions. I think it would be much more robust too.
XHTML would be best, though I agree we shouldn't care about validity so much as just well-formedness (which is required). I think it should be possible to do it with valid XHTML, though, since whether that's desired or not is a python.org policy concern. (Not that I suspect we'll ever really care about that.)
Of course, it should be possible to parse with htmllib and HTMLParser as well.
I still think requiring even HTML validity or well-formedness is YAGNI; one could indeed just pull all well-formed URLs from the page. EasyInstall uses this case-insensitive regular expression to find only href'd urls: href\s*=\s*['"]?([^'"> ]+) In the absence of a requirement for more information than this (perhaps coupled with a "rel" attribute in the same element), I'm wary of starting out by requiring even well-formedness, because it's way overkill for the requirements as I understand them. One of the advantage of defining the URL layout as part of the API is that it gives you enough contextual information to decide what links should be followed, and which ones are purely informational. Indeed, the only reason to look at anything *but* hrefs is to indicate that an *external* (i.e. non-index) link should be followed, to spider for other download links. So if following external links is out of scope for the API we want to define, then *any* information other than the URLs in an API page are YAGNI. Now, all of this is based on my assumption that the use case here is somebody wants to throw together a rough-and-ready package index that tools should be able to use to find *downloadable distributions*. If you and Jim have much more elaborate use cases in mind, then of course some well-formedness might be useful. On the other hand, if such rigor is required, then it seems like we should just be using machine-readable data in the first place, rather than using a dual-purpose format like HTML or XHTML. Just go with a specialized XML dialect or some kind of text format (ZConfig? ;) ) and be done with it.

On Jul 7, 2006, at 2:31 PM, Phillip J. Eby wrote:
At 02:04 PM 7/7/2006 -0400, Fred Drake wrote:
+1 on static pages. I don't, however, see a reason to require valid XML. Or rather, I don't expect to implement XML parsing in easy_install; if the spec is too complex to implement with regular expression matching, it's probably too complex for people to
together an index with what's at hand. In particular, I'd
On 7/7/06, Jim Fulton <jim@zope.com> wrote: throw like it
to be practical to put together a simple index just using Apache's built-in directory indexes, as long as they use the right URL hierarchy. That means that class or rel attributes should only be required for links that are requesting non-index pages to be spidered.
I would find parsing much easier with an XML parser than with regular expressions. I think it would be much more robust too.
XHTML would be best, though I agree we shouldn't care about validity so much as just well-formedness (which is required). I think it should be possible to do it with valid XHTML, though, since whether that's desired or not is a python.org policy concern. (Not that I suspect we'll ever really care about that.)
Of course, it should be possible to parse with htmllib and HTMLParser as well.
I still think requiring even HTML validity or well-formedness is YAGNI; one could indeed just pull all well-formed URLs from the page. EasyInstall uses this case-insensitive regular expression to find only href'd urls:
href\s*=\s*['"]?([^'"> ]+)
In the absence of a requirement for more information than this (perhaps coupled with a "rel" attribute in the same element), I'm wary of starting out by requiring even well-formedness, because it's way overkill for the requirements as I understand them.
But I thought we *were* talking about adding rel or class tags so that we could determine information about the intended use of a URL.
One of the advantage of defining the URL layout as part of the API is that it gives you enough contextual information to decide what links should be followed, and which ones are purely informational.
Perhaps someone should propose an API and we'll see. :)
Indeed, the only reason to look at anything *but* hrefs is to indicate that an *external* (i.e. non-index) link should be followed, to spider for other download links. So if following external links is out of scope for the API we want to define, then *any* information other than the URLs in an API page are YAGNI.
Who said following external links is out of scope.
Now, all of this is based on my assumption that the use case here is somebody wants to throw together a rough-and-ready package index that tools should be able to use to find *downloadable distributions*. If you and Jim have much more elaborate use cases in mind, then of course some well-formedness might be useful.
setuptools has a notion of an index. That notion is not at all well defined. Currently, the index has linkes that are followed to find package links elsewhere. This seems reasonably useful. I dunno. I'm not sure I care. What I do care about is that the index API should be well defined so that we can implement alternate indexes and alternate tools to read indexes. I'm not looking to satisfy use cases beyond what we have now. All I want is an API. :) I'm not bent on XML. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org

At 02:52 PM 7/7/2006 -0400, Jim Fulton wrote:
On Jul 7, 2006, at 2:31 PM, Phillip J. Eby wrote:
At 02:04 PM 7/7/2006 -0400, Fred Drake wrote:
+1 on static pages. I don't, however, see a reason to require valid XML. Or rather, I don't expect to implement XML parsing in easy_install; if the spec is too complex to implement with regular expression matching, it's probably too complex for people to
together an index with what's at hand. In particular, I'd
On 7/7/06, Jim Fulton <jim@zope.com> wrote: throw like it
to be practical to put together a simple index just using Apache's built-in directory indexes, as long as they use the right URL hierarchy. That means that class or rel attributes should only be required for links that are requesting non-index pages to be spidered.
I would find parsing much easier with an XML parser than with regular expressions. I think it would be much more robust too.
XHTML would be best, though I agree we shouldn't care about validity so much as just well-formedness (which is required). I think it should be possible to do it with valid XHTML, though, since whether that's desired or not is a python.org policy concern. (Not that I suspect we'll ever really care about that.)
Of course, it should be possible to parse with htmllib and HTMLParser as well.
I still think requiring even HTML validity or well-formedness is YAGNI; one could indeed just pull all well-formed URLs from the page. EasyInstall uses this case-insensitive regular expression to find only href'd urls:
href\s*=\s*['"]?([^'"> ]+)
In the absence of a requirement for more information than this (perhaps coupled with a "rel" attribute in the same element), I'm wary of starting out by requiring even well-formedness, because it's way overkill for the requirements as I understand them.
But I thought we *were* talking about adding rel or class tags so that we could determine information about the intended use of a URL.
Yes -- but they're only needed to support following second-order external links: i.e., links to non-index HTML pages.
One of the advantage of defining the URL layout as part of the API is that it gives you enough contextual information to decide what links should be followed, and which ones are purely informational.
Perhaps someone should propose an API and we'll see. :)
I thought I already did. :) Here it is again: baseURL/ should return a page containing href links to projects baseURL/projectname should return a page containing href links to version pages baseURL/projectname/version should return a page with download links (ideally with MD5 info) Links are found via href="" attributes URLs' trailing path components are used to identify distributions. This is a sufficient API to allow querying packages for downloading purposes, as long as all download links are found in the index's pages. Additional information is only needed to allow following external links to *other index pages*. Coincidentally, easy_install is already mostly compatible with such an API; it would mostly be a matter of *removing* things from easy_install, rather than adding them.
Indeed, the only reason to look at anything *but* hrefs is to indicate that an *external* (i.e. non-index) link should be followed, to spider for other download links. So if following external links is out of scope for the API we want to define, then *any* information other than the URLs in an API page are YAGNI.
Who said following external links is out of scope.
Nobody; I was just saying that *if* it were out of scope, the class/rel stuff would become unnecessary.
Now, all of this is based on my assumption that the use case here is somebody wants to throw together a rough-and-ready package index that tools should be able to use to find *downloadable distributions*. If you and Jim have much more elaborate use cases in mind, then of course some well-formedness might be useful.
setuptools has a notion of an index. That notion is not at all well defined.
It's mostly operationally defined in terms of what PyPI did when it was written.
Currently, the index has linkes that are followed to find package links elsewhere. This seems reasonably useful. I dunno. I'm not sure I care. What I do care about is that the index API should be well defined so that we can implement alternate indexes and alternate tools to read indexes. I'm not looking to satisfy use cases beyond what we have now.
Sure. I'm just saying we only need something beyond href="" links if they are intended to be followed by tools looking for package links. The reason this is necessary, is that it's not sufficient to just follow links that point outside the package index; PyPI has links on its pages that go to other parts of python.org, so there needs to be something that distinguishes "links that might help find downloads". Links that *are* downloads are detected via URL content.

On Jul 7, 2006, at 4:20 PM, Phillip J. Eby wrote:
At 02:52 PM 7/7/2006 -0400, Jim Fulton wrote: ...
Perhaps someone should propose an API and we'll see. :)
I thought I already did. :) Here it is again:
baseURL/ should return a page containing href links to projects baseURL/projectname should return a page containing href links to version pages baseURL/projectname/version should return a page with download links (ideally with MD5 info) Links are found via href="" attributes URLs' trailing path components are used to identify distributions.
Hm. I hadn't seen this before. Perhaps I'm missing some messages from this thread. By "download links", do you mean links to distributions? Or to links to pages containing links to distributions. Can the links to projects, links to version pages, or download links point off site? Can any of these pages contain other links?
This is a sufficient API to allow querying packages for downloading purposes, as long as all download links are found in the index's pages. Additional information is only needed to allow following external links to *other index pages*.
so, for example: http://www.python.org/pypi/ZODB3/3.6.0 Has a link to http://www.zope.org/Products/ZODB3.6. Is this a download link? Or an off-site index link. I'm having a little trouble following the jargon.
setuptools has a notion of an index. That notion is not at all well defined.
It's mostly operationally defined in terms of what PyPI did when it was written.
Right, not well defined. :) I'm not criticizing. What it does was great as a prototype, but it would be good move beyond this.
Currently, the index has linkes that are followed to find package links elsewhere. This seems reasonably useful. I dunno. I'm not sure I care. What I do care about is that the index API should be well defined so that we can implement alternate indexes and alternate tools to read indexes. I'm not looking to satisfy use cases beyond what we have now.
Sure. I'm just saying we only need something beyond href="" links if they are intended to be followed by tools looking for package links.
The reason this is necessary, is that it's not sufficient to just follow links that point outside the package index; PyPI has links on its pages that go to other parts of python.org, so there needs to be something that distinguishes "links that might help find downloads". Links that *are* downloads are detected via URL content.
Right. That's why I think the hrefs we care about should be marked with class attributes or some such. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org

At 04:45 PM 7/7/2006 -0400, Jim Fulton wrote:
On Jul 7, 2006, at 4:20 PM, Phillip J. Eby wrote:
At 02:52 PM 7/7/2006 -0400, Jim Fulton wrote: ...
Perhaps someone should propose an API and we'll see. :)
I thought I already did. :) Here it is again:
baseURL/ should return a page containing href links to projects baseURL/projectname should return a page containing href links to version pages baseURL/projectname/version should return a page with download links (ideally with MD5 info) Links are found via href="" attributes URLs' trailing path components are used to identify distributions.
Hm. I hadn't seen this before. Perhaps I'm missing some messages from this thread.
By "download links", do you mean links to distributions?
Yes.
Or to links to pages containing links to distributions.
No, these would be either "index pages", or "external links"
Can the links to projects, links to version pages, or download links point off site?
Download links can be anywhere, since they are identified from the tail of the URL. The links to project or version pages are defined by the URL hierarchy of the API.
Can any of these pages contain other links?
All of them can contain download links. Index pages can link to other index pages. Index pages linked to anything else are ignored, unless we allow "external links", in which case a method of identifying them is required. Currently, easy_install identifies only uses two kinds of external links: home page and "download URL". These are identified via HTML snippets that PyPI uses. This is one of only two pieces of "screen scraping" (as opposed to URL inspection and link detection) that easy_install has. (The other is used to distinguish between a page that lists links to projects, from an actual project page, as sometimes PyPI can display the former at a URL that is nominally for the latter.)
This is a sufficient API to allow querying packages for downloading purposes, as long as all download links are found in the index's pages. Additional information is only needed to allow following external links to *other index pages*.
so, for example:
http://www.python.org/pypi/ZODB3/3.6.0
Has a link to http://www.zope.org/Products/ZODB3.6. Is this a download link? Or an off-site index link. I'm having a little trouble following the jargon.
It's an "external link", and thus only followed if it's seen to be the "home page" or "download URL" on a package version page.
Sure. I'm just saying we only need something beyond href="" links if they are intended to be followed by tools looking for package links.
The reason this is necessary, is that it's not sufficient to just follow links that point outside the package index; PyPI has links on its pages that go to other parts of python.org, so there needs to be something that distinguishes "links that might help find downloads". Links that *are* downloads are detected via URL content.
Right. That's why I think the hrefs we care about should be marked with class attributes or some such.
Yes, as long as we care about supporting the external links. I'm not certain we do, at least for the "third-party index" case.

On Jul 7, 2006, at 9:12 PM, Phillip J. Eby wrote:
At 04:45 PM 7/7/2006 -0400, Jim Fulton wrote: ...
By "download links", do you mean links to distributions?
Yes.
Or to links to pages containing links to distributions.
No, these would be either "index pages", or "external links"
Which seems to be an important use case now.
Can the links to projects, links to version pages, or download links point off site?
Download links can be anywhere, since they are identified from the tail of the URL. The links to project or version pages are defined by the URL hierarchy of the API.
Hm. Why does it matter? I understand that you want to be able to go to index_url/project first, but I don't see that it matters where versions actually are. For that matter, I could see value in a minimal index that just pointed to external project pages. In which case, going to index_url/project might even be allowed to redirect to an offsite project page. Of course, this couldn't be implemented with a static server, but could still be a valuable option.
Can any of these pages contain other links?
All of them can contain download links. Index pages can link to other index pages. Index pages linked to anything else are ignored, unless we allow "external links", in which case a method of identifying them is required.
I think we want external links. We have them now. In fact, I think there is value in a project index that has no distributions or even version information but provides a central place to find project pages. Note that, in a separate discussion, you pointed out that some considered it bad form to put interim project releases on pypi. If pypi could have links to remote project pages, then those sites could have different policies as needed by a project.
Currently, easy_install identifies only uses two kinds of external links: home page and "download URL". These are identified via HTML snippets that PyPI uses. This is one of only two pieces of "screen scraping" (as opposed to URL inspection and link detection) that easy_install has. (The other is used to distinguish between a page that lists links to projects, from an actual project page, as sometimes PyPI can display the former at a URL that is nominally for the latter.)
This is a sufficient API to allow querying packages for downloading purposes, as long as all download links are found in the index's pages. Additional information is only needed to allow following external links to *other index pages*.
so, for example:
http://www.python.org/pypi/ZODB3/3.6.0
Has a link to http://www.zope.org/Products/ZODB3.6. Is this a download link? Or an off-site index link. I'm having a little trouble following the jargon.
It's an "external link", and thus only followed if it's seen to be the "home page" or "download URL" on a package version page.
Right, which is currently identified by sniffing the surrounding HTML.
Sure. I'm just saying we only need something beyond href="" links if they are intended to be followed by tools looking for package links.
The reason this is necessary, is that it's not sufficient to just follow links that point outside the package index; PyPI has links on its pages that go to other parts of python.org, so there needs to be something that distinguishes "links that might help find downloads". Links that *are* downloads are detected via URL content.
Right. That's why I think the hrefs we care about should be marked with class attributes or some such.
Yes, as long as we care about supporting the external links. I'm not certain we do, at least for the "third-party index" case.
I think we do. I'm pretty sure we do for pypi and I sure has heck don't want a different api for pypi and for other indexes. I'd really like to see a single index api. I would *like* to see the possibility of allowing off-site (off- index) projects, although I could live without this. I have to say again that all of these details can get quite confusing. Maybe I'm alone in being confused by this, but I don't think so. I've spent a lot of time on and off over the last few months trying to leverage setuptools and now pypi and while I've had a lot of success, it has been harder than I think it should be. I think that this is an impediment to greater adoption of and benefit from setuptools. I think we need to do a good job of documenting and explaining this API. I also think we need to write up some best practices or rational to guide people toward better use of setuptools and pypi together. I'm happy to help with this once we have agreement and once I understand what we agree to. :) Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org
participants (2)
-
Jim Fulton
-
Phillip J. Eby