Newby: How do I strip HTML tags?

Fri Jun 7 12:12:21 EDT 2002

netvegetable wrote

> I'm mucking around with cgi, and I'm trying to work out a way 
> to strip the 
> html tags of a string. e.g, I want to convert this...
> 
> ><font size = 12><b><big>Really Big String</big></b></font>
> 
> to this this ...
> 
> >Really Big String
> 
> ... and store it as a value.
> 
> I worked out a crude, but effective way of doing it (see code 
> below), but I 
> can't escape the feeling there must be a built in way of 
> doing it more. If 
> nothing else, I'm sure somebody who knows their regular 
> expressions could 
> neaten it up (please?).
> 
> def strip_html_tags(it):
> 	left = it[:(len(it)/2)]
> 	right = it[(len(it)/2):]
> 	final = left[left.rfind('>')+1:] + right[:right.find('<')]
> 	return final
> 

If your HTML is reasonably legal, then you can use something along the lines of the following very quick and very dirty program:

import re
import sys

s = open(sys.argv[1]).read()
o = open('tmp.tmp', 'w')
r = re.compile('(<!--.*?-->)|(<[^>]*>)([^<]+)', re.DOTALL)
for x, y, z in r.findall(s):
    if z and not z.isspace():   #don't use comments tags and white-space only content
        print >>o, z

Note that you have to test first for HTML comments as a comment can contain a '>' character.

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.