[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
veillard at redhat.com
Thu Aug 26 23:19:00 CEST 2004
On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter Dörwald wrote:
> Chuck Bearden wrote:
> >I haven't browsed through the dependencies to see what of the other
> >Twisted pieces the microdom requires, so I can't say if it is extricable
> >from the wider framework.
> >One possibility I didn't try was to use tidy to generate real XHTML from
> >the crappy HTML. It might then be posssible to use something more
> >common like the minidom implementation to navigate the HTML.
> >For me, extracting data from malformed but consistent HTML is a
> >necessary task, so I do sometimes have to make some compromises
> >in my selection and use of tools.
> There are already tools that make sense of broken HTML: browsers.
> Is there any way to reuse that functionality from Python? I.e.
> something like:
> >>> import mozilla
> >>> x = mozilla.parse("http://www.python.org")
> I don't care whether I get a DOM or a string parsable by an
> XML parser.
libxml2 HTML parser is part of libxml2 Python bindings.
doc = libxml2.htmlParseFile(URI, None)
at that point doc is a DOM tree, like you would have if you had
parsed XML, you can use XPath, navigate, extract and reserialize.
You may have got a bunch of errors and warning, but you will get a
tree even if the HTML is really bizarre.
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//head/title")
title = res.content
title = "Page %s" % (resource)
is the kind of code I use to index HTML pages and feed an
SQL database for searches on xmlsoft.org. I also do
# We are not interested in parsing errors here
def callback(ctx, str):
to ignore all error and warning since I run it as cron batches.
Daniel Veillard | Red Hat Desktop team http://redhat.com/
veillard at redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
More information about the XML-SIG