Need direction on mass find/replacement in HTML files

Peter Otten __peter__ at web.de
Fri Apr 30 16:27:39 EDT 2010


KevinUT wrote:

> Hello Folks:
> 
> I want to globally change the following: <a href="http://
> www.mysite.org/?page=contacts"><font color="#269BD5">
> 
> into: <a href="pages/contacts.htm"><font color="#269BD5">
> 
> You'll notice that the match would be http://www.mysite.org/?page= but
> I also need to add a ".htm" to the end of "contacts" so it becomes
> "contacts.htm" This part of the URL is variable, so how can I use a
> combination of Python and/or a regular expression to replace the match
> the above and also add a ".htm" to the end of that variable part?
> 
> Here are a few dummy URLs for example so you can see the pattern and
> the variable too.
> 
> <a href="http://www.mysite.org/?page=newsletter"><font
> color="#269BD5">
> 
> change to: <a href="pages/newsletter.htm"><font color="#269BD5">
> 
> <a href="http://www.mysite.org/?page=faq">
> 
> change to: <a href="pages/faq.htm">
> 
> So, again the script needs to replace all the full absolute URL links
> with nothing and replace the PHP "?page=" with just the variable page
> name (i.e. contacts) plus the ".htm"
> 
> Is there a combination of Python code and/or regex that can do this?
> Any help would be greatly appreciated!

Don't know if the following will in practice be more reliable than a simple 
regex, but here goes:

import sys
import urlparse
from BeautifulSoup import BeautifulSoup as BS

if __name__ == "__main__":
    html = open(sys.argv[1]).read()
    bs = BS(html)
    for a in bs("a"):
        href = a["href"]
        url = urlparse.urlparse(href)
        if url.netloc == "www.mysite.org":
            qs = urlparse.parse_qs(url.query)
            a["href"] = "pages/" + qs[u"page"][0] + ".htm"
        print
    print bs

Peter



More information about the Python-list mailing list