how to get rid of html tags
kokohh at hotmail.com
Fri Oct 4 00:07:34 CEST 2002
Thanks a lot. It worked well if the tags are on the same line.
But if the tag is broked to a few lines, it will not work.
"Cameron Laird" <claird at lairds.org> wrote in message
news:anhj3t$mg9$1 at lairds.org...
> In article <mailman.1033619587.32128.python-list at python.org>,
> Ian Bicking <ianb at colorstudy.com> wrote:
> >The easy answer:
> >page = re.sub(r'<.*?>', '', page)
> >There may be more Correct answers, though. (Some HTML has unquoted <>
> >characters, which browsers accept even though it's super annoying to
> >parse -- but I don't know that htmllib parses improper HTML either)
> >On Wed, 2002-10-02 at 20:04, koko wrote:
> >> I am trying to retrieve a web page.
> >> But I only want to keep the content of the webpage without the html
> >> How can I parse the webpage to get rid of the tags?
> People answer this question in *dozens* of different
> ways. Perhaps the most satisfying to koko will be
> dialectically. Does, for example, command-line
> lynx -dump $URL > $RESULT
> meet all your requirements?
> Cameron Laird <Cameron at Lairds.com>
> Business: http://www.Phaseit.net
> Personal: http://starbase.neosoft.com/~claird/home.html
More information about the Python-list