Q: how to extract only text from a html ?

Alex Martelli aleaxit at yahoo.com
Wed Nov 1 16:16:57 EST 2000


"Gerrit Holl" <gerrit at NOSPAM.nl.linux.org> wrote in message
news:slrn90061d.11i.gerrit at stopcontact.palga.uucp...
> On Tue, 31 Oct 2000 13:50:54 -0600, Hwanjo Yu wrote:
> > Could someone please tell me how to get rid of all the tags in a html ?
> > It seems that the htmllib.HTMLParser is not helpful to do it.
>
> Maybe you should have a look at regular expressions, the re module.
> There's extremely much possible with it. Have you had a look at it?

I think htmllib (a solution based on which has already been
posted) is a much better idea to handle HTML, than trying to
do it with re's.  HTML syntax is not parsable with re's,  while
htmllib does a decent job of it, I think.


Alex






More information about the Python-list mailing list