how to get rid of html tags

R.Marquez ny_r_marquez at yahoo.com
Thu Oct 3 13:52:39 EDT 2002


"koko" <kokohh at hotmail.com> wrote in message news:<AIMm9.1440$XX3.895043 at newssrv26.news.prodigy.com>...
> I am trying to retrieve a web page.
> But I only want to keep the content of the webpage without the html tags.
> How can I  parse the webpage to get rid of the tags?

The WeaselWeb program has a Python module called htm2txt.py.  Maybe it
can be useful to you.
To test it simply type at the command line:

    Python htm2txt.py "Some Page.htm"

The module WeaselWeb.py has a couple of very simple methods of
downloading the page (on with ie+com and the other with urllib,
urlparse).
Download the source versions of WeaselWeb to get at them. 

http://sourceforge.net/project/showfiles.php?group_id=9595&release_id=105094

(But, if you have a Palm Pilot you may enjoy the binary one ;).

-Ruben



More information about the Python-list mailing list