[Doc-SIG] URI schemes (was Re: [Docstring-develop] DPS - possible bugs/features)

Tony J Ibbs (Tibs) tony@lsl.co.uk
Tue, 25 Sep 2001 11:11:28 +0100


David Goodger wrote:
> [Again, of general interest. Especially: Does anyone know
> of  a URI scheme registry or official list? (URI schemes
> are "http", "ftp", "mailto", etc.; the part of a URI before
> the ":".)]

Hmm. The best references I found for URIs were (taken from comments in
the stpy version of docutils/TextRE.py):

  * "An index of WWW addressing schemes",
    http://www.w3.org/Addressing/schemes.html - note
    that this is an evolving document!

  * "Regex for URLs",
    http://www.foad.org/~abigail/Perl/url2.html,
    which shows rather well both how to do it and also
    why just about noone does

The first is probably about as official as you're going to get, and
shows why I gave up on the idea of being all inclusive!

> [Tony]
> > Is ``a:b`` *really* likely to be a sensible URI, given that ``a`` is
> > entirely "local"?
>
> What do you mean by "local"?

Sorry - "relative" rather than "absolute". And I should have said ``b``,
not ``a``. And it doesn't make sense to say that for some schemes. Oh
well.

> > Should we be treating with the whole possible gamut of URIs, or
> > restricting ourselves to those most likely?
>
> There are two approaches:
>
> 1. Recognize all possible URI schemes, based on the grammar from
>    RFC2396. This has the unwanted side effect that ``a:b`` is
>    accidentally recognized as a URI. The workaround is to use inline
>    literals (not always correct: "the signal:noise ratio") or escape
>    the colon (ugly).
>
> 2, Recognize only "registered" URI schemes. Accidents like ``a:b``
>    won't happen. The disadvantage is that new URI schemes need to be
>    added to the parser. I have yet to find a definitive registry of
>    URI schemes (anybody know of one?), and I don't want to spend the
>    rest of my life adding new schemes as they pop up.
>
> Currently the reStructuredText parser takes approach #1. I wouldn't
> want to attempt #2 without an official & complete URI scheme
> reference.

It sounds, from reading the first reference above, as if it is not
possible to have an inclusive and final list of all URIs (note the
example of registering "note:" with IE so that one can browse files
using Notepad - I could instead have called it
"supercalifradgilisticexpealidocious" (?spelling) for all anyone else
can tell). So that kills proposal 2.

The third way (not that I'm recommending *it*, either), is to identify a
"common subset" of schemes that we recognise. That appears to be what
other people normally do - of course, I'm exactly the sort of person who
then comes along and wants my uncommon scheme to be added to said
"common" subset...

A fourth way (again, not necessarily one I'm advocating - but it's only
moderately yucky) would be to say "these 'common' schemes are recognised
as-is/inline, but if you want an 'odd' scheme, you need to delimit your
uri" - in the context of reST, I guess that would mean something like::

    :uri:`strange-scheme:hum-ti-hum`

(a role seems natural here, and *looks* a bit like one of the common
ways of inidicating URIs in plaintext). Then we get to play with "which
schemes are 'common'" (the "obvious" answer is
[http,ftp,file,mailto,news], but that's only for my value of obvious).

I *do* think that there might be some objection (as you say) to having
to escape colons within text in a Python context - slices are just so
important (it might not be as bad as Guido's objection to reserving "<"
and ">" as delimiters, but still pretty bad). So it *may* be that the
fourth option is our simplest bet...

Tibs

--
Tony J Ibbs (Tibs)      http://www.tibsnjoan.co.uk/
"Bounce with the bunny. Strut with the duck.
 Spin with the chickens now - CLUCK CLUCK CLUCK!"
BARNYARD DANCE! by Sandra Boynton
My views! Mine! Mine! (Unless Laser-Scan ask nicely to borrow them.)