regex for href substitution
Robin Becker
robin at jessikat.fsnet.co.uk
Wed Feb 19 04:07:43 EST 2003
In article <mailman.1045612196.24428.python-list at python.org>, Ian
Bicking <ianb at colorstudy.com> writes
.......
>
>Anyway, a regex like this will mostly work:
>
>href_re = re.compile(r'(<a[^>]+href=")([^"]*)(".*?>)', re.I | re.S)
>page = href_re.sub(subber, page)
>def subber(match):
> return match.group(1) + rewrite_url(match.group(2)) + match.group(3)
>
>
.....
the thing with the above approach is that it's a bit naive, href
attributes come in a lot of shapes
<a href = 'my url'> fails the above so we need lots of white spaces
and alternates, we don't actually need quotes eg
<a href=/cgi-bin/bongo.cgi> should be legal older html. Also there are
other possible attributes now in <a> tags so we can't be sure it's
always <a href=> how about <a class="bingo" href="...">.
The more I think about it the more I seem to prefer the htmllib
approach.
--
Robin Becker
More information about the Python-list
mailing list