[Distutils] [Catalog-sig] Specification for package indexes?

Fri Jul 7 22:20:11 CEST 2006

At 02:52 PM 7/7/2006 -0400, Jim Fulton wrote:

>On Jul 7, 2006, at 2:31 PM, Phillip J. Eby wrote:
>
>>At 02:04 PM 7/7/2006 -0400, Fred Drake wrote:
>>>On 7/7/06, Jim Fulton <jim at zope.com> wrote:
>>> > > +1 on static pages.  I don't, however, see a reason to require
>>> > > valid XML.  Or rather, I don't expect to implement XML parsing in
>>> > > easy_install; if the spec is too complex to implement with
>>>regular
>>> > > expression matching, it's probably too complex for people to
>>>throw
>>> > > together an index with what's at hand.  In particular, I'd
>>>like it
>>> > > to be practical to put together a simple index just using
>>>Apache's
>>> > > built-in directory indexes, as long as they use the right URL
>>> > > hierarchy.  That means that class or rel attributes should
>>>only be
>>> > > required for links that are requesting non-index pages to be
>>>spidered.
>>> >
>>> > I would find parsing much easier with an XML parser  than with
>>> > regular expressions.
>>> > I  think it would be much more robust too.
>>>
>>>XHTML would be best, though I agree we shouldn't care about validity
>>>so much as just well-formedness (which is required).  I think it
>>>should be possible to do it with valid XHTML, though, since whether
>>>that's desired or not is a python.org policy concern.  (Not that I
>>>suspect we'll ever really care about that.)
>>>
>>>Of course, it should be possible to parse with htmllib and
>>>HTMLParser as well.
>>
>>I still think requiring even HTML validity or well-formedness is
>>YAGNI; one could indeed just pull all well-formed URLs from the
>>page.  EasyInstall uses this case-insensitive regular expression to
>>find only href'd urls:
>>
>>     href\s*=\s*['"]?([^'"> ]+)
>>
>>In the absence of a requirement for more information than this
>>(perhaps coupled with a "rel" attribute in the same element), I'm
>>wary of starting out by requiring even well-formedness, because
>>it's way overkill for the requirements as I understand them.
>
>But I thought we *were* talking about adding rel or class tags so
>that we
>could determine information about the intended use of a URL.

Yes -- but they're only needed to support following second-order external 
links: i.e., links to non-index HTML pages.

>>One of the advantage of defining the URL layout as part of the API
>>is that it gives you enough contextual information to decide what
>>links should be followed, and which ones are purely informational.
>
>Perhaps someone should propose an API and we'll see. :)

I thought I already did.  :)  Here it is again:

baseURL/ should return a page containing href links to projects
baseURL/projectname should return a page containing href links to version pages
baseURL/projectname/version should return a page with download links 
(ideally with MD5 info)
Links are found via href="" attributes
URLs' trailing path components are used to identify distributions.

This is a sufficient API to allow querying packages for downloading 
purposes, as long as all download links are found in the index's 
pages.  Additional information is only needed to allow following external 
links to *other index pages*.

Coincidentally, easy_install is already mostly compatible with such an API; 
it would mostly be a matter of *removing* things from easy_install, rather 
than adding them.

>>Indeed, the only reason to look at anything *but* hrefs is to
>>indicate that an *external* (i.e. non-index) link should be
>>followed, to spider for other download links.  So if following
>>external links is out of scope for the API we want to define, then
>>*any* information other than the URLs in an API page are YAGNI.
>
>Who said following external links is out of scope.

Nobody; I was just saying that *if* it were out of scope, the class/rel 
stuff would become unnecessary.

>>Now, all of this is based on my assumption that the use case here
>>is somebody wants to throw together a rough-and-ready package index
>>that tools should be able to use to find *downloadable
>>distributions*.  If you and Jim have much more elaborate use cases
>>in mind, then of course some well-formedness might be useful.
>
>setuptools has a notion of an index.  That notion is not at all well
>defined.

It's mostly operationally defined in terms of what PyPI did when it was 
written.

>Currently, the index has linkes that are followed to find package
>links elsewhere.
>This seems reasonably useful.  I dunno.  I'm not sure I care.  What I
>do care
>about is that the index API should be well defined so that we can
>implement
>alternate indexes and alternate tools to read indexes.  I'm not
>looking to
>satisfy use cases beyond what we have now.

Sure.  I'm just saying we only need something beyond href="" links if they 
are intended to be followed by tools looking for package links.

The reason this is necessary, is that it's not sufficient to just follow 
links that point outside the package index; PyPI has links on its pages 
that go to other parts of python.org, so there needs to be something that 
distinguishes "links that might help find downloads".  Links that *are* 
downloads are detected via URL content.