[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
Walter Dörwald
walter at livinglogic.de
Fri Aug 27 19:52:16 CEST 2004
Daniel Veillard wrote:
> On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter Dörwald wrote:
>
>> [...]
>>There are already tools that make sense of broken HTML: browsers.
>>
>>Is there any way to reuse that functionality from Python? I.e.
>>something like:
>>
>>
>>>>>import mozilla
>>>>>x = mozilla.parse("http://www.python.org")
>>
>>I don't care whether I get a DOM or a string parsable by an
>>XML parser.
>
> libxml2 HTML parser is part of libxml2 Python bindings.
>
> import libxml2
>
> doc = libxml2.htmlParseFile(URI, None)
This looks great. When I dump the DOM again, the resulting
files look much better then those generated by HTMLParser
from the standard library or my own HTML parser.
BTW, I wonder why libxml2 complains about the following:
>>> doc = libxml2.htmlParseFile("http://www.python.org", None)
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>
I think the next version of XIST will use libxml2 instead
of uTidyLib for parsing HTML.
Bye,
Walter Dörwald
More information about the XML-SIG
mailing list