convert xhtml back to html
tim.arnold at sas.com
Thu Apr 24 18:46:17 CEST 2008
"Gary Herron" <gherron at islandtraining.com> wrote in message
news:mailman.130.1209053543.12834.python-list at python.org...
> Tim Arnold wrote:
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
>> to create CHM files. That application really hates xhtml, so I need to
>> convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>> Seems simple enough, but I'm having some trouble with it. regexps trip up
>> because I also have to take into account 'img', 'meta', 'link' tags, not
>> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to
>> do that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work.
>> I'm not enough of a regexp pro to figure out that lookahead stuff.
>> I'm not sure where to start now; I looked at BeautifulSoup and
>> BeautifulStoneSoup, but I can't see how to modify the actual tag.
>> --Tim Arnold
> Whether or not you can find an application that does what you want, I
> don't know, but at the very least I can say this much.
> You should not be reading and parsing the text yourself! XHTML is valid
> XML, and there a lots of ways to read and parse XML with Python.
> (ElementTree is what I use, but other choices exist.) Once you use an
> existing package to read your files into an internal tree structure
> representation, it should be a relatively easy job to traverse the tree to
> emit the tags and text you want.
> Gary Herron
I agree and I'd really rather not parse it myself. However, ET will clean up
the file which in my case includes some comments required as metadata, so
that won't work. Oh, I could get ET to read it and write a new parser--I see
what you mean. I think I need to subclass so I could get ET to honor those
That's one way to go, I was just hoping for something easier.
More information about the Python-list