How to convert markup text to plain text in python?

Tim Chase python.list at tim.thechases.com
Fri Feb 1 11:27:28 EST 2008


> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
> 
> I've looked around a bit but failed to find anything, any tips?
> 
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


Well, if all you want to do is remove everything from a "<" to a
">", you can use

  >>> s = "<B>Today</B> is <U>Friday</U>"
  >>> import re
  >>> r = re.compile('<[^>]*>')
  >>> print r.sub('', s)
  Today is Friday

it should even work for semi-pathological cases such as

 s = """You can find my <a
   href='http://example.com'>thesis</a
   > online"""

where the tag contents are split across lines.  There are more
pathological cases where tags aren't well-formed, e.g.

  s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"

in which case you get what you deserve for making such
pathological conditions ;-)

-tkc






More information about the Python-list mailing list