[XML-SIG] How to get SAX to parse not well formed HTML doc?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 18 Jul 2001 11:02:33 +0200


> I've used the attached script to turn html into xml for minidom, and it 
> seems to work fairly well so long as the html doesn't contain text cut 
> and pasted from Microsoft Word. 

Hi Douglas,

Please note that your approach has many problems. In particular, the
converter does not consider the HTML DTD. E.g. converting

<html>
<head>
<title>Hallo
<body>
</html>

will give you

<!-- Yay!, successfully parsed -->
<html>
<head>
<title>Hallo
<body>
</body></title></head></html>

While this is well-formed XML, it is not well-formed XHTML; it should read

<html>
<head>
<title>Hallo</title>
</head>
<body>
</body></html>

instead (i.e. title and head must close before body opens). Another
thing I noticed is that it messes up external entities, e.g.

<html>
<head>
<title>Hall&ouml;chen</title>
</head>
<body>
</body></html>

is converted to

<html>
<head>
<title>Hall&amp;ouml;chen</title>
</head>
<body>
</body></html>


> Another thing I've done is put tohtml() and writehtml() methods in my 
> version of minidom. They're the same as toxml & writexml, except they 
> test empty elements against a tuple: br, img, link and so forth are 
> rendered <br /> (note the space) while other empty tags are written the 
> long way - <td></td>, <p></p> etc. It's really simple. Would this be of 
> any use to anyone else, or would it be just clutter up minidom.py?

I don't think it should go into minidom. Instead, it might be useful
to have such a function as a stand-alone library, which prints
arbitrary XHTML DOM trees. In fact, the best thing may be to extend
the XHTML pretty printer with such a feature.

Regards,
Martin