How to convert markup text to plain text in python?
Steve Holden
steve at holdenweb.com
Fri Feb 1 11:43:55 EST 2008
Tim Chase wrote:
>> I have some marked up text and would like to convert it to plain text,
>> by simply removing all the tags. Of course I can do it from first
>> principles but I felt that among all Python's markup tools there must
>> be something that would do this simply, without having to create an
>> XML parser etc.
>>
>> I've looked around a bit but failed to find anything, any tips?
>>
>> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")
>
>
> Well, if all you want to do is remove everything from a "<" to a
> ">", you can use
>
> >>> s = "<B>Today</B> is <U>Friday</U>"
> >>> import re
> >>> r = re.compile('<[^>]*>')
> >>> print r.sub('', s)
> Today is Friday
>
> it should even work for semi-pathological cases such as
>
> s = """You can find my <a
> href='http://example.com'>thesis</a
> > online"""
>
> where the tag contents are split across lines. There are more
> pathological cases where tags aren't well-formed, e.g.
>
> s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"
>
> in which case you get what you deserve for making such
> pathological conditions ;-)
>
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list
mailing list