[XML-SIG] Parsing malformed XHTML

Sat May 20 03:09:36 CEST 2006

Hello all,

There a document out there on the 'net that appears to be an XHTML document:

<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en"
  "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
  xmlns:v="urn:schemas-microsoft-com:vml">

Great, right?  But unfortunately it's malformed in a number of ways
(mismatched tags, tag case problems, unescaped '&' in URLs, etc).
Neither minidom.parseStream() nor
xml.dom.ext.reader.Sax2.Reader.fromStream() will parse it correctly:

  xml.sax._exceptions.SAXParseException: foo.html:2:0: syntax error

And even if one gets rid of the bogus doctype declaration, the rest of
the document just makes the parsers fall over:

  xml.sax._exceptions.SAXParseException: foo.html:14:53: not
well-formed (invalid token)

My next thought was to parse this with
xml.dom.ext.reader.HtmlLib...but HtmlLib doesn't like the namespace
declarations:

  xml.dom.NamespaceErr: Invalid or illegal namespace operation

I need to parse this document into a DOM, make some changes, and then
spit back out the modified file as (X?)HTML (ideally well-formed).  Am
I going to be able to do this with PyXML?  If not, I'd love to hear
your suggestions for the appropriate tools.

Thanks!

-- Lars

-- 
Lars Kellogg-Stedman <lars at larsshack.org>