How to convert markup text to plain text in python?

Tim Chase python.list at tim.thechases.com
Fri Feb 1 11:54:19 EST 2008


>> Well, if all you want to do is remove everything from a "<" to a
>> ">", you can use
>>
>>   >>> s = "<B>Today</B> is <U>Friday</U>"
>>   >>> import re
>>   >>> r = re.compile('<[^>]*>')
>>   >>> print r.sub('', s)
>>   Today is Friday
>>
[Tim's ramblings about pathological cases snipped]
>
> The real answer to this question is "learn how to use Beautiful Soup" -- 
> see http://www.crummy.com/software/BeautifulSoup/

Yes, for more pathological cases, BS does a great job of parsing
junk :)

However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.

-tkc






More information about the Python-list mailing list