Strip HTML tags from downloaded files
Hans Nowak
wurmy at earthlink.net
Wed Dec 5 18:54:27 EST 2001
Thomas Pham wrote:
>
> When I use urlretrieve to download a file from the web, the raw text file have HTML tags embedded at the beginning and the end of the file.
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
> <HEAD>
>
> </PRE>
> </BODY></HTML>
>
> Is there anyway to strip all the HTML tags from the file?
>
> Thanks,
> --
Look into the sgmllib module. Here's an example; you will probably
want to change it to suit your needs:
#---begin---
import sgmllib
import string
class HTMLStripper(sgmllib.SGMLParser):
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.data = []
def handle_data(self, data):
self.data.append(data)
def getdata(self):
return string.join(self.data)
if __name__ == "__main__":
hs = HTMLStripper()
hs.feed(open("filename.html", "rb").read())
print hs.getdata()
HTH,
--Hans
More information about the Python-list
mailing list