convert xhtml back to html

Tim Arnold tim.arnold at
Thu Apr 24 18:46:17 CEST 2008

"Gary Herron" <gherron at> wrote in message 
news:mailman.130.1209053543.12834.python-list at
> Tim Arnold wrote:
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop 
>> to create  CHM files. That application really hates xhtml, so I need to 
>> convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>> Seems simple enough, but I'm having some trouble with it. regexps trip up 
>> because I also have to take into account 'img', 'meta', 'link' tags, not 
>> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to 
>> do that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. 
>> I'm not enough of a regexp pro to figure out that lookahead stuff.
>> I'm not sure where to start now; I looked at BeautifulSoup and 
>> BeautifulStoneSoup, but I can't see how to modify the actual tag.
>> thanks,
>> --Tim Arnold
>> --
> Whether or not you can find an application that does what you want, I 
> don't know, but at the very least I can say this much.
> You should not be reading and parsing the text yourself!  XHTML is valid 
> XML, and there a lots of ways to read and parse XML with Python. 
> (ElementTree is what I use, but other choices exist.)   Once you use an 
> existing package to read your files into an internal tree structure 
> representation, it should be a relatively easy job to traverse the tree to 
> emit the tags and text you want.
> Gary Herron
I agree and I'd really rather not parse it myself. However, ET will clean up 
the file which in my case includes some comments required as metadata, so 
that won't work. Oh, I could get ET to read it and write a new parser--I see 
what you mean. I think I need to subclass so I could get ET to honor those 
comments too.
That's one way to go, I was just hoping for something easier.

More information about the Python-list mailing list