regex for href substitution

Ian Bicking ianb at colorstudy.com
Tue Feb 18 18:48:11 EST 2003


On Tue, 2003-02-18 at 16:44, Robin Becker wrote:
> I'm sure this must have been done before, but has anyone got a regex for
> extracting/changing html href attributes. I've done this before with
> htmllib, but it's been suggested that we can do this with re.
> 
> The application involves doing a special purpose forwarding proxy, so
> perhaps someone has already done something similar.

Sure, I've done this before, though I can't recall where the code is
now.

Anyway, a regex like this will mostly work:

href_re = re.compile(r'(<a[^>]+href=")([^"]*)(".*?>)', re.I | re.S)
page = href_re.sub(subber, page)
def subber(match):
    return match.group(1) + rewrite_url(match.group(2)) + match.group(3)


Where rewrite_url does whatever rewriting you want.  You'll also want to
add a base href to the page, so that images, CSS, and other stuff will
work.  So the rewritten URL should give the complete URL.  This won't
catch the cases where someone uses <a href=page>, i.e., without quotes. 
You'll have to use the regex r'(<a[^>]+href=)([^"][^ >]*)(.*?>)' for
that.  

You may wish to use the urlparse module as well.

-- 
Ian Bicking  ianb at colorstudy.com  http://colorstudy.com
4869 N. Talman Ave., Chicago, IL 60625  /  773-275-7241
"There is no flag large enough to cover the shame of 
 killing innocent people" -- Howard Zinn






More information about the Python-list mailing list