regex for href substitution

Ian Bicking ianb at
Wed Feb 19 00:48:11 CET 2003

On Tue, 2003-02-18 at 16:44, Robin Becker wrote:
> I'm sure this must have been done before, but has anyone got a regex for
> extracting/changing html href attributes. I've done this before with
> htmllib, but it's been suggested that we can do this with re.
> The application involves doing a special purpose forwarding proxy, so
> perhaps someone has already done something similar.

Sure, I've done this before, though I can't recall where the code is

Anyway, a regex like this will mostly work:

href_re = re.compile(r'(<a[^>]+href=")([^"]*)(".*?>)', re.I | re.S)
page = href_re.sub(subber, page)
def subber(match):
    return + rewrite_url( +

Where rewrite_url does whatever rewriting you want.  You'll also want to
add a base href to the page, so that images, CSS, and other stuff will
work.  So the rewritten URL should give the complete URL.  This won't
catch the cases where someone uses <a href=page>, i.e., without quotes. 
You'll have to use the regex r'(<a[^>]+href=)([^"][^ >]*)(.*?>)' for

You may wish to use the urlparse module as well.

Ian Bicking  ianb at
4869 N. Talman Ave., Chicago, IL 60625  /  773-275-7241
"There is no flag large enough to cover the shame of 
 killing innocent people" -- Howard Zinn

More information about the Python-list mailing list