[Distutils] [Catalog-sig] Specification for package indexes?

Fri Jul 7 20:31:01 CEST 2006

At 02:04 PM 7/7/2006 -0400, Fred Drake wrote:
>On 7/7/06, Jim Fulton <jim at zope.com> wrote:
> > > +1 on static pages.  I don't, however, see a reason to require
> > > valid XML.  Or rather, I don't expect to implement XML parsing in
> > > easy_install; if the spec is too complex to implement with regular
> > > expression matching, it's probably too complex for people to throw
> > > together an index with what's at hand.  In particular, I'd like it
> > > to be practical to put together a simple index just using Apache's
> > > built-in directory indexes, as long as they use the right URL
> > > hierarchy.  That means that class or rel attributes should only be
> > > required for links that are requesting non-index pages to be spidered.
> >
> > I would find parsing much easier with an XML parser  than with
> > regular expressions.
> > I  think it would be much more robust too.
>
>XHTML would be best, though I agree we shouldn't care about validity
>so much as just well-formedness (which is required).  I think it
>should be possible to do it with valid XHTML, though, since whether
>that's desired or not is a python.org policy concern.  (Not that I
>suspect we'll ever really care about that.)
>
>Of course, it should be possible to parse with htmllib and HTMLParser as well.

I still think requiring even HTML validity or well-formedness is YAGNI; one 
could indeed just pull all well-formed URLs from the page.  EasyInstall 
uses this case-insensitive regular expression to find only href'd urls:

     href\s*=\s*['"]?([^'"> ]+)

In the absence of a requirement for more information than this (perhaps 
coupled with a "rel" attribute in the same element), I'm wary of starting 
out by requiring even well-formedness, because it's way overkill for the 
requirements as I understand them.

One of the advantage of defining the URL layout as part of the API is that 
it gives you enough contextual information to decide what links should be 
followed, and which ones are purely informational.

Indeed, the only reason to look at anything *but* hrefs is to indicate that 
an *external* (i.e. non-index) link should be followed, to spider for other 
download links.  So if following external links is out of scope for the API 
we want to define, then *any* information other than the URLs in an API 
page are YAGNI.

Now, all of this is based on my assumption that the use case here is 
somebody wants to throw together a rough-and-ready package index that tools 
should be able to use to find *downloadable distributions*.  If you and Jim 
have much more elaborate use cases in mind, then of course some 
well-formedness might be useful.

On the other hand, if such rigor is required, then it seems like we should 
just be using machine-readable data in the first place, rather than using a 
dual-purpose format like HTML or XHTML.  Just go with a specialized XML 
dialect or some kind of text format (ZConfig? ;) ) and be done with it.