[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Walter Dörwald walter at livinglogic.de
Fri Aug 27 19:52:16 CEST 2004


Daniel Veillard wrote:

 > On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter Dörwald wrote:
 >
 >> [...]
 >>There are already tools that make sense of broken HTML: browsers.
 >>
 >>Is there any way to reuse that functionality from Python? I.e.
 >>something like:
 >>
 >>
 >>>>>import mozilla
 >>>>>x = mozilla.parse("http://www.python.org")
 >>
 >>I don't care whether I get a DOM or a string parsable by an
 >>XML parser.
 >
 >   libxml2 HTML parser is part of libxml2 Python bindings.
 >
 >   import libxml2
 >
 >   doc = libxml2.htmlParseFile(URI, None)

This looks great. When I dump the DOM again, the resulting
files look much better then those generated by HTMLParser
from the standard library or my own HTML parser.

BTW, I wonder why libxml2 complains about the following:

 >>> doc = libxml2.htmlParseFile("http://www.python.org", None)
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid 
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>

I think the next version of XIST will use libxml2 instead
of uTidyLib for parsing HTML.

Bye,
    Walter Dörwald




More information about the XML-SIG mailing list