[XML-SIG] Parsing malformed XHTML
Lars Kellogg-Stedman
lars at larsshack.org
Sat May 20 03:09:36 CEST 2006
Hello all,
There a document out there on the 'net that appears to be an XHTML document:
<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en"
"http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:v="urn:schemas-microsoft-com:vml">
Great, right? But unfortunately it's malformed in a number of ways
(mismatched tags, tag case problems, unescaped '&' in URLs, etc).
Neither minidom.parseStream() nor
xml.dom.ext.reader.Sax2.Reader.fromStream() will parse it correctly:
xml.sax._exceptions.SAXParseException: foo.html:2:0: syntax error
And even if one gets rid of the bogus doctype declaration, the rest of
the document just makes the parsers fall over:
xml.sax._exceptions.SAXParseException: foo.html:14:53: not
well-formed (invalid token)
My next thought was to parse this with
xml.dom.ext.reader.HtmlLib...but HtmlLib doesn't like the namespace
declarations:
xml.dom.NamespaceErr: Invalid or illegal namespace operation
I need to parse this document into a DOM, make some changes, and then
spit back out the modified file as (X?)HTML (ideally well-formed). Am
I going to be able to do this with PyXML? If not, I'd love to hear
your suggestions for the appropriate tools.
Thanks!
-- Lars
--
Lars Kellogg-Stedman <lars at larsshack.org>
More information about the XML-SIG
mailing list