How do I convert escaped HTML into a string?

Sergio Correia sergio.correia at gmail.com
Sat Nov 24 07:02:17 CET 2007


This may help:

http://effbot.org/zone/re-sub.htm#strip-html

You should take care that there are several issues about going from html to txt

1) <p> What should <b>we</b>do about<br />this?</p>
You need to strip all tags..

2) &quot;, &amp;, &lt;, and &gt... and I could keep going.. we need to
convert all those

3) we need to remove all whitespace.. tab, new lines, etc. (Maybe
breaks should be considered as new lines in the new text?)

The link above solve several of this issues, it can serve as a good
starting point.

Best,
Sergio


On Nov 24, 2007 12:42 AM, Just Another Victim of the Ambient Morality
<ihatespam at hotmail.com> wrote:
>     I've done a google search on this but, amazingly, I'm the first guy to
> ever need this!  Everyone else seems to need the reverse of this.  Actually,
> I did find some people who complained about this and rolled their own
> solution but I refuse to believe that Python doesn't have a built-in
> solution to what must be a very common problem.
>     So, how do I convert HTML to plaintext?  Something like this:
>
>
> <div>Thisisastring.</div>
>
>
>     ...into:
>
>
> This is a string.
>
>
>     Actually, the ideal would be a function that takes an HTML string and
> convert it into a string that the HTML would correspond to.  For instance,
> converting:
>
>
> <div>This &    that
> or the other thing.</div>
>
>
>     ...into:
>
>
> This & that or the other thing.
>
>
>     ...since HTML seems to convert any amount and type of whitespace into a
> single space (a bizarre design choice if I've ever seen one).
>     Surely, Python can already do this, right?
>     Thank you...
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list