Newby: How do I strip HTML tags?

Harvey Thomas hst at empolis.co.uk
Fri Jun 7 12:17:44 EDT 2002


Harvey Thomas wrote

> netvegetable wrote
> 
> > I'm mucking around with cgi, and I'm trying to work out a way 
> > to strip the 
> > html tags of a string. e.g, I want to convert this...
> > 
> > ><font size = 12><b><big>Really Big String</big></b></font>
> > 
> > to this this ...
> > 
> > >Really Big String
> > 
> > ... and store it as a value.
> > 
> > I worked out a crude, but effective way of doing it (see code 
> > below), but I 
> > can't escape the feeling there must be a built in way of 
> > doing it more. If 
> > nothing else, I'm sure somebody who knows their regular 
> > expressions could 
> > neaten it up (please?).
> > 
> > def strip_html_tags(it):
> > 	left = it[:(len(it)/2)]
> > 	right = it[(len(it)/2):]
> > 	final = left[left.rfind('>')+1:] + right[:right.find('<')]
> > 	return final
> > 
> 
> If your HTML is reasonably legal, then you can use something 
> along the lines of the following very quick and very dirty program:
> 
> import re
> import sys
> 
> s = open(sys.argv[1]).read()
> o = open('tmp.tmp', 'w')
> r = re.compile('(<!--.*?-->)|(<[^>]*>)([^<]+)', re.DOTALL)
> for x, y, z in r.findall(s):
>     if z and not z.isspace():   #don't use comments tags and 
> white-space only content
>         print >>o, z
> 
> Note that you have to test first for HTML comments as a 
> comment can contain a '>' character.
> 

Sorry, the RE should read
r = re.compile('(<!--.*?-->)|(<[^>]*>)|([^<]+)', re.DOTALL)

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.





More information about the Python-list mailing list