[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Chuck Bearden cbearden at hal-pc.org
Wed Aug 25 22:56:39 CEST 2004


On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote:
>
> Honestly, I don't think DOM is the way I would personally go about
> processing HTML, which is why I was trying to get at whether there was
> another way for you to meet your needs.

I think I understand what you are getting at, but personally I have
found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps
an mx.Tidying stage beforehand, to be invaluable in mining data from
database-generated webpages built with crappy HTML.  Consider the pages
displaying individual patent records at the USPTO, e.g. [1].  If you 
need to treat such pages as if they were XML records to be parsed and
loaded into a database, something like twisted.web.microdom is a big 
help.

Chuck Bearden

[1] http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6295859.WKU.&OS=PN/6295859&RS=PN/6295859


More information about the XML-SIG mailing list