How to convert markup text to plain text in python?

Steve Holden steve at holdenweb.com
Fri Feb 1 11:43:55 EST 2008


Tim Chase wrote:
>> I have some marked up text and would like to convert it to plain text,
>> by simply removing all the tags. Of course I can do it from first
>> principles but I felt that among all Python's markup tools there must
>> be something that would do this simply, without having to create an
>> XML parser etc.
>>
>> I've looked around a bit but failed to find anything, any tips?
>>
>> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")
> 
> 
> Well, if all you want to do is remove everything from a "<" to a
> ">", you can use
> 
>   >>> s = "<B>Today</B> is <U>Friday</U>"
>   >>> import re
>   >>> r = re.compile('<[^>]*>')
>   >>> print r.sub('', s)
>   Today is Friday
> 
> it should even work for semi-pathological cases such as
> 
>  s = """You can find my <a
>    href='http://example.com'>thesis</a
>    > online"""
> 
> where the tag contents are split across lines.  There are more
> pathological cases where tags aren't well-formed, e.g.
> 
>   s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"
> 
> in which case you get what you deserve for making such
> pathological conditions ;-)
> 
The real answer to this question is "learn how to use Beautiful Soup" -- 
see http://www.crummy.com/software/BeautifulSoup/

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/




More information about the Python-list mailing list