How to convert markup text to plain text in python?
Tim Chase
python.list at tim.thechases.com
Fri Feb 1 11:54:19 EST 2008
>> Well, if all you want to do is remove everything from a "<" to a
>> ">", you can use
>>
>> >>> s = "<B>Today</B> is <U>Friday</U>"
>> >>> import re
>> >>> r = re.compile('<[^>]*>')
>> >>> print r.sub('', s)
>> Today is Friday
>>
[Tim's ramblings about pathological cases snipped]
>
> The real answer to this question is "learn how to use Beautiful Soup" --
> see http://www.crummy.com/software/BeautifulSoup/
Yes, for more pathological cases, BS does a great job of parsing
junk :)
However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.
-tkc
More information about the Python-list
mailing list