Strip HTML tags from downloaded files

Hans Nowak wurmy at earthlink.net
Wed Dec 5 18:54:27 EST 2001


Thomas Pham wrote:
> 
> When I use urlretrieve to download a file from the web, the raw text file have HTML tags embedded at the beginning and the end of the file.
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
>  <HEAD>
> 
> </PRE>
> </BODY></HTML>
> 
> Is there anyway to strip all the HTML tags from the file?
> 
> Thanks,
> --

Look into the sgmllib module. Here's an example; you will probably
want to change it to suit your needs:

#---begin---

import sgmllib
import string

class HTMLStripper(sgmllib.SGMLParser):

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.data = []

    def handle_data(self, data):
        self.data.append(data)

    def getdata(self):
        return string.join(self.data)


if __name__ == "__main__":

    hs = HTMLStripper()
    hs.feed(open("filename.html", "rb").read())
    print hs.getdata()


HTH,

--Hans



More information about the Python-list mailing list