convert xhtml back to html

Thu Apr 24 12:46:15 EDT 2008

Arnaud Delobelle wrote:
> "Tim Arnold" <tim.arnold at sas.com> writes:
> 
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to 
>> create  CHM files. That application really hates xhtml, so I need to convert 
>> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>>
>> Seems simple enough, but I'm having some trouble with it. regexps trip up 
>> because I also have to take into account 'img', 'meta', 'link' tags, not 
>> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do 
>> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not 
>> enough of a regexp pro to figure out that lookahead stuff.
> 
> Hi, I'm not sure if this is very helpful but the following works on
> the very simple example below.
> 
>>>> import re
>>>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
>>>> xtag = re.compile(r'<([^>]*?)/>') 
>>>> xtag.sub(r'<\1>', xhtml)
> '<p>hello <img src="/img.png"> spam <br> bye </p>'

You might try XIST (http://www.livinglogic.de/Python/xist):

Code looks like this:

from ll.xist import parsers
from ll.xist.ns import html

xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'

doc = parsers.parsestring(xhtml)
print doc.bytes(xhtml=0)

This outputs:

<p>hello <img src="/img.png"> spam <br> bye </p>

(and a warning that the alt attribute is missing in the img ;))

Servus,
    Walter