Q: how to extract only text from a html ?

Moshe Zadka moshez at math.huji.ac.il
Thu Nov 2 03:09:48 EST 2000


On Thu, 2 Nov 2000, Fredrik Lundh wrote:

> Alex wrote:
> > I think htmllib (a solution based on which has already been
> > posted) is a much better idea to handle HTML, than trying to
> > do it with re's.  HTML syntax is not parsable with re's,  while
> > htmllib does a decent job of it, I think.
> 
> footnote: htmllib (or rather, sgmllib) uses regular expressions
> to parse HTML (SGML).  maybe you meant "cannot be parsed
> with a single re"?
> 
> (on the other hand, you can parse XML with a single RE, and
> I don't see why you cannot use a similar technique to parse
> HTML...)

You can probably tokenize (not parse) XML with a single RE. There are
probably less then 3 guys in the world who can write that RE *correctly*.
Perhaps it's now easier with everything you added to SRE, though..
--
Moshe Zadka <moshez at math.huji.ac.il> -- 95855124
http://advogato.org/person/moshez





More information about the Python-list mailing list