How do I convert escaped HTML into a string?
Stefan Behnel
stefan.behnel-n05pAM at web.de
Sat Nov 24 00:58:42 EST 2007
Just Another Victim of the Ambient Morality wrote:
> I've done a google search on this but, amazingly, I'm the first guy to
> ever need this!
You cannot infer that from a Google search.
> So, how do I convert HTML to plaintext? Something like this:
>
> <div>This is a string.</div>
>
> ...into:
>
> This is a string.
>
> Actually, the ideal would be a function that takes an HTML string and
> convert it into a string that the HTML would correspond to. For instance,
> converting:
>
> <div>This & that
> or the other thing.</div>
>
> ...into:
>
> This & that or the other thing.
>
> ...since HTML seems to convert any amount and type of whitespace into a
> single space (a bizarre design choice if I've ever seen one).
So what you want to do is parse HTML and extract the text content. There are
quite a few ways to do that, including lxml.html:
http://codespeak.net/lxml/dev/lxmlhtml.html
>>> htmldata = """<div>This & that
... or the other thing.</div>
>>> from lxml import html
>>> print html.fragment_fromstring(htmldata).text_content()
Stefan
More information about the Python-list
mailing list