regex for href substitution
Ian Bicking
ianb at colorstudy.com
Tue Feb 18 18:48:11 EST 2003
On Tue, 2003-02-18 at 16:44, Robin Becker wrote:
> I'm sure this must have been done before, but has anyone got a regex for
> extracting/changing html href attributes. I've done this before with
> htmllib, but it's been suggested that we can do this with re.
>
> The application involves doing a special purpose forwarding proxy, so
> perhaps someone has already done something similar.
Sure, I've done this before, though I can't recall where the code is
now.
Anyway, a regex like this will mostly work:
href_re = re.compile(r'(<a[^>]+href=")([^"]*)(".*?>)', re.I | re.S)
page = href_re.sub(subber, page)
def subber(match):
return match.group(1) + rewrite_url(match.group(2)) + match.group(3)
Where rewrite_url does whatever rewriting you want. You'll also want to
add a base href to the page, so that images, CSS, and other stuff will
work. So the rewritten URL should give the complete URL. This won't
catch the cases where someone uses <a href=page>, i.e., without quotes.
You'll have to use the regex r'(<a[^>]+href=)([^"][^ >]*)(.*?>)' for
that.
You may wish to use the urlparse module as well.
--
Ian Bicking ianb at colorstudy.com http://colorstudy.com
4869 N. Talman Ave., Chicago, IL 60625 / 773-275-7241
"There is no flag large enough to cover the shame of
killing innocent people" -- Howard Zinn
More information about the Python-list
mailing list