Newby: How do I strip HTML tags?
Harvey Thomas
hst at empolis.co.uk
Fri Jun 7 12:17:44 EDT 2002
Harvey Thomas wrote
> netvegetable wrote
>
> > I'm mucking around with cgi, and I'm trying to work out a way
> > to strip the
> > html tags of a string. e.g, I want to convert this...
> >
> > ><font size = 12><b><big>Really Big String</big></b></font>
> >
> > to this this ...
> >
> > >Really Big String
> >
> > ... and store it as a value.
> >
> > I worked out a crude, but effective way of doing it (see code
> > below), but I
> > can't escape the feeling there must be a built in way of
> > doing it more. If
> > nothing else, I'm sure somebody who knows their regular
> > expressions could
> > neaten it up (please?).
> >
> > def strip_html_tags(it):
> > left = it[:(len(it)/2)]
> > right = it[(len(it)/2):]
> > final = left[left.rfind('>')+1:] + right[:right.find('<')]
> > return final
> >
>
> If your HTML is reasonably legal, then you can use something
> along the lines of the following very quick and very dirty program:
>
> import re
> import sys
>
> s = open(sys.argv[1]).read()
> o = open('tmp.tmp', 'w')
> r = re.compile('(<!--.*?-->)|(<[^>]*>)([^<]+)', re.DOTALL)
> for x, y, z in r.findall(s):
> if z and not z.isspace(): #don't use comments tags and
> white-space only content
> print >>o, z
>
> Note that you have to test first for HTML comments as a
> comment can contain a '>' character.
>
Sorry, the RE should read
r = re.compile('(<!--.*?-->)|(<[^>]*>)|([^<]+)', re.DOTALL)
Harvey
_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.
More information about the Python-list
mailing list