[Doc-SIG] What counts as a url?

Tony J Ibbs (Tibs) tony@lsl.co.uk
Mon, 19 Mar 2001 10:35:48 -0000

Edward Welbourne ("Eddy" for reasons of clarity) wrote:
> <sigh>.  Why do regexes always feel like the right answer to the wrong
> question - albethey useful - ?

Because they are, of course (well, actually, you have it exactly

> Tibs: would mxTextTools let us say this stuff less uglily ?

Well, in my opinion, yes, but that's because it's actually a proper
parser, so one takes a different approach. Not that I'm volunteering to
write it, mind you.

> But the right answer is to use the urlparse module, not to ad-hock
> together your own; if you don't like how urlparse does things, fix it.
> (Note: I'm as guilty as anyone on this - I'd written a much longer
> version of this e-mail, complete with my own od-hack regex,
> before even thinking to look for a module, at which point I instantly
> *knew* the module was bound to exist - and not be to my liking.)

There are two problems here:

1. Find the candidate (possible) URL
2. Validate it as such

The first is the one we're addressing proximately, and for once I would
argue that it is better to find *too many* matches, rather than too few.

The second is what Eddy appears to be talking about, with urlparse, etc.
It is optional (i.e., one would only do it if validation is selected).
It *may* be hard to "unstitch" the markup that has already occurred by
the time validation is done, so it is likely to get left until later.

Given a big problem, leave it until later...


Tony J Ibbs (Tibs)      http://www.tibsnjoan.co.uk/
"How fleeting are all human passions compared with the massive
continuity of ducks." - Dorothy L. Sayers, "Gaudy Night"
My views! Mine! Mine! (Unless Laser-Scan ask nicely to borrow them.)