[Doc-SIG] What counts as a url?

M.-A. Lemburg mal@lemburg.com
Sun, 18 Mar 2001 23:21:48 +0100

Edward Welbourne wrote:
> >>     ([a-zA-Z0-9-_.!~*'();/?:@&=+$,#] | %[0-9a-fA-F][0-9a-fA-F])+
> >       r'\b((?:http|ftp|https|mailto)://[\w@&#-_.!~*();]+\b/?)'
> erm ...
> I'm fairly sure you're allowed at most one # and at most one ? in an
> URL: any others *must* be url-encoded as %[0-9A-Fa-f]{2} tokens.  I'm
> fairly sure you aren't allowed an & before the ? and that the # has to
> appear after the ? and all &
> Marc's regex doesn't mention = and ? explicitly, but they're definitely
> allowed in URLs.  Are () really allowed in URLs ? 


> How about {} and [] ?

No. See the RFC Apendix A for details.

> I'm fairly sure : and , are allowed in paths.  But I'd expect :,{}()[]*!
> all to be url-endoced, anyway, so they shouldn't appear in the regexen;
> they're covered by % and \w.
> There is an RFC for URIs, I mailed it to Edward recently;
> I guess that'd be
> >> and looked up "RFC 2396":http://www.w3.org/Addressing/rfc2396.txt .
> so go read the appendices (pedantically).

FYI, here's a working reference: http://sunsite.dk/RFC/rfc/rfc2396.html
> I know the relevant RFC has a helpful Appendix A giving BNF and Appendix
> B advising how to parse, complete with a regex for parsing (which
> presumes you *check* separately, based on the BNF).
> I really don't like that space between the URL and the full-stop (sorry,
> `period', to translate into North American Anglic); but, no, I can't see
> how to avoid it.  Other than to treat the end of a URL as `this may have
> been the end of a sentence', even if it isn't followed by a . so authors
> of doc-strings know they can treat the URL as sentence-end (unconvinced).

Note that the RE I mentioned was not supposed to parse all URLs
allowed by the different standards out there. The bug you found wasn't
intended either, BTW ;-) The RE is basically a very simple 
approximation of what is allowed and finds most instances of 
URLs in plain text.

> oh - Mark:
> >       r'\b((?:http|ftp|https|mailto)://[\w@&#-_.!~*();]+\b/?)'
> did you really mean `from # to _ inclusive'   ^^^ or did you mean to say
> `#, - or _' ?  Hmm, I think you mean the latter: put - last in the [...]
> But the latter reading claims you've missed out / in the path, and the
> former claims most entries in your [...] are duplicates of ones in the
> #-_ range.  I'm confused.

It's a bug, just like the omission of "/=?" which was covered up
by re.compile() using the whole range #-_ of characters...
> If the - is last, and we mention = explicitly, we can phrase the
> character class as [...=;-] with its hair standing on end, which seems
> entirely appropriate.

Good idea ;) 

Oh and please also add the slash and all other character in #-_ which
could be useful in URLs.
> <sigh>.  Why do regexes always feel like the right answer to the wrong
> question - albethey useful - ?
> Edward: will working in EBNF spare us these messes ?
> (I'm assuming that's Extended Baccus-Nauer Form, subject to spelling.)

Appendix A of the RFC has a "Collected" BNF form -- doesn't look any
simpler than the RE, though, only less frightening.

> Tibs: would mxTextTools let us say this stuff less uglily ?

Not less ugly, but certainly with more certainty as to what passes
and what not...
> I'm inclined to advise running with the way the RFC's appendices'
> approach the problem, though: first, parse according to Appendix B's
> regex, then (it explains better than I can here) take the fragments into
> which it's cut your putative URL text and check each fragment for
> validity according to the appropriate rules in appendix A, which depend
> on the scheme; if any fail their check, decide that this wasn't a URL
> anyway.  Albeit this means fully parsing the URL, so maybe the right
> function to add to urlparse is one which reads, from a string, the
> longest initial chunk which is a URL, returning a tuple whose first item
> is the length, remainder are urlparse.urlparse()'s answers (at least
> when the length is positive).

Seems overly complicated to me, but if you really care for standards
confrom URI recognition then I'd suggest to go ahead and write
a patch for urllib which defines a function for finding URLs in text,
e.g. findurl(text, start, end) -> (urlstart, urlend) or None.

> > I don't think it makes sense to include schemes which are not
> > supported by your everyday browser, so only the most common ones
> > are included.
> I think it does make sense to include them, for two reasons:
>   i) we should *recognise that the text is a URL* even when we know not
>      what to do with it, if only so we can warn the user - the principle
>      of least surprise says that if you *have to* surprise the user
>      (whose browser does know about a scheme you're ignoring) you should
>      at least have the decency to warn.
>  ii) forward compatibility - someone may add a scheme that really does
>      deserve to be in there, and the tool should need minimal revision
>      to cope.
> and I thought most browsers *did* cope with gopher, which you omit ...
> Yes, I admit it, I'm an old fuddy-duddy.
> But the right answer is to use the urlparse module, not to ad-hock
> together your own; if you don't like how urlparse does things, fix it.
> (Note: I'm as guilty as anyone on this - I'd written a much longer
> version of this e-mail, complete with my own od-hack regex, before even
> thinking to look for a module, at which point I instantly *knew* the
> module was bound to exist - and not be to my liking.)

Marc-Andre Lemburg
Company & Consulting:                           http://www.egenix.com/
Python Pages:                           http://www.lemburg.com/python/