Stripping HTML tags from a string
aleaxit at yahoo.com
Wed May 2 21:12:29 CEST 2001
"Colin Meeks" <colinmeeks at home.com> wrote in message
news:lrYH6.2444$2_.528918 at news3.rdc1.on.home.com...
> I know I've seen this somewhere before, but can't find it now I want it.
> Does anybody know how to strip all HTML tags from a string. I imagine I
> would use a regular expression, but am not fully up to speed on these yet.
You _can_ do it with regular expressions, but it's hard to get full
generality. Standard module sgmllib is SO much easier to use...
> i.e "<P>Hello<P><FONT FACE="Arial">This is really cool</FONT> isn't
> it<BR>The End"
> would give me "Hello This is really cool isn't it The End"
> I would like to replace all <P> and <BR> with a space as this would result
> in something that is more readable.
self.result = 
def do_p(self, *junk):
def do_br(self, *junk):
def handle_data(self, data):
if __name__ == '__main__':
data = """<P>Hello<P><FONT FACE="Arial">This is really cool</FONT>
parser = Cleaner()
Running this produces:
Hello This is really cool isn't
it The End
which isn't QUITE what you asked for, but then there are contradictions
between some aspects of your specs -- e.g. you specifically asked for
all <P> tags to be "replaced with a space", yet your example string
starts with a <P> but the desired result does NOT start with a space.
Anyway, I hope this is clear enough to let you solve such contradictions
and get exactly the kind of processing that you DO really require!
More information about the Python-list