How to convert markup text to plain text in python?
steve at holdenweb.com
Fri Feb 1 17:43:55 CET 2008
Tim Chase wrote:
>> I have some marked up text and would like to convert it to plain text,
>> by simply removing all the tags. Of course I can do it from first
>> principles but I felt that among all Python's markup tools there must
>> be something that would do this simply, without having to create an
>> XML parser etc.
>> I've looked around a bit but failed to find anything, any tips?
>> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")
> Well, if all you want to do is remove everything from a "<" to a
> ">", you can use
> >>> s = "<B>Today</B> is <U>Friday</U>"
> >>> import re
> >>> r = re.compile('<[^>]*>')
> >>> print r.sub('', s)
> Today is Friday
> it should even work for semi-pathological cases such as
> s = """You can find my <a
> > online"""
> where the tag contents are split across lines. There are more
> pathological cases where tags aren't well-formed, e.g.
> s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"
> in which case you get what you deserve for making such
> pathological conditions ;-)
The real answer to this question is "learn how to use Beautiful Soup" --
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list