[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Thu Aug 26 22:24:38 CEST 2004

Chuck Bearden wrote:

> [...]
> I haven't browsed through the dependencies to see what of the other
> Twisted pieces the microdom requires, so I can't say if it is extricable
> from the wider framework.
> 
> One possibility I didn't try was to use tidy to generate real XHTML from
> the crappy HTML.  It might then be posssible to use something more
> common like the minidom implementation to navigate the HTML.
> 
> For me, extracting data from malformed but consistent HTML is a 
> necessary task, so I do sometimes have to make some compromises
> in my selection and use of tools.

There are already tools that make sense of broken HTML: browsers.

Is there any way to reuse that functionality from Python? I.e.
something like:

 >>> import mozilla
 >>> x = mozilla.parse("http://www.python.org")

I don't care whether I get a DOM or a string parsable by an
XML parser.

Bye,
    Walter Dörwald