regex for href substitution

Robin Becker robin at jessikat.fsnet.co.uk
Wed Feb 19 04:07:43 EST 2003


In article <mailman.1045612196.24428.python-list at python.org>, Ian
Bicking <ianb at colorstudy.com> writes
.......
>
>Anyway, a regex like this will mostly work:
>
>href_re = re.compile(r'(<a[^>]+href=")([^"]*)(".*?>)', re.I | re.S)
>page = href_re.sub(subber, page)
>def subber(match):
>    return match.group(1) + rewrite_url(match.group(2)) + match.group(3)
>
>
.....
the thing with the above approach is that it's a bit naive, href
attributes come in a lot of shapes

<a href  =   'my url'> fails the above so we need lots of white spaces
and alternates, we don't actually need quotes eg

<a href=/cgi-bin/bongo.cgi> should be legal older html. Also there are
other possible attributes now in <a> tags so we can't be sure it's
always <a href=> how about <a class="bingo" href="...">.

The more I think about it the more I seem to prefer the htmllib
approach. 
-- 
Robin Becker




More information about the Python-list mailing list