[Distutils] [Catalog-sig] Specification for package indexes?
Phillip J. Eby
pje at telecommunity.com
Fri Jul 7 20:31:01 CEST 2006
At 02:04 PM 7/7/2006 -0400, Fred Drake wrote:
>On 7/7/06, Jim Fulton <jim at zope.com> wrote:
> > > +1 on static pages. I don't, however, see a reason to require
> > > valid XML. Or rather, I don't expect to implement XML parsing in
> > > easy_install; if the spec is too complex to implement with regular
> > > expression matching, it's probably too complex for people to throw
> > > together an index with what's at hand. In particular, I'd like it
> > > to be practical to put together a simple index just using Apache's
> > > built-in directory indexes, as long as they use the right URL
> > > hierarchy. That means that class or rel attributes should only be
> > > required for links that are requesting non-index pages to be spidered.
> >
> > I would find parsing much easier with an XML parser than with
> > regular expressions.
> > I think it would be much more robust too.
>
>XHTML would be best, though I agree we shouldn't care about validity
>so much as just well-formedness (which is required). I think it
>should be possible to do it with valid XHTML, though, since whether
>that's desired or not is a python.org policy concern. (Not that I
>suspect we'll ever really care about that.)
>
>Of course, it should be possible to parse with htmllib and HTMLParser as well.
I still think requiring even HTML validity or well-formedness is YAGNI; one
could indeed just pull all well-formed URLs from the page. EasyInstall
uses this case-insensitive regular expression to find only href'd urls:
href\s*=\s*['"]?([^'"> ]+)
In the absence of a requirement for more information than this (perhaps
coupled with a "rel" attribute in the same element), I'm wary of starting
out by requiring even well-formedness, because it's way overkill for the
requirements as I understand them.
One of the advantage of defining the URL layout as part of the API is that
it gives you enough contextual information to decide what links should be
followed, and which ones are purely informational.
Indeed, the only reason to look at anything *but* hrefs is to indicate that
an *external* (i.e. non-index) link should be followed, to spider for other
download links. So if following external links is out of scope for the API
we want to define, then *any* information other than the URLs in an API
page are YAGNI.
Now, all of this is based on my assumption that the use case here is
somebody wants to throw together a rough-and-ready package index that tools
should be able to use to find *downloadable distributions*. If you and Jim
have much more elaborate use cases in mind, then of course some
well-formedness might be useful.
On the other hand, if such rigor is required, then it seems like we should
just be using machine-readable data in the first place, rather than using a
dual-purpose format like HTML or XHTML. Just go with a specialized XML
dialect or some kind of text format (ZConfig? ;) ) and be done with it.
More information about the Distutils-SIG
mailing list