[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Uche Ogbuji uche.ogbuji at fourthought.com
Thu Aug 26 20:38:09 CEST 2004


On Wed, 2004-08-25 at 14:56, Chuck Bearden wrote:
> On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote:
> >
> > Honestly, I don't think DOM is the way I would personally go about
> > processing HTML, which is why I was trying to get at whether there was
> > another way for you to meet your needs.
> 
> I think I understand what you are getting at, but personally I have
> found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps
> an mx.Tidying stage beforehand, to be invaluable in mining data from
> database-generated webpages built with crappy HTML.  Consider the pages
> displaying individual patent records at the USPTO, e.g. [1].  If you 
> need to treat such pages as if they were XML records to be parsed and
> loaded into a database, something like twisted.web.microdom is a big 
> help.

Is this available without installing all of Twisted?


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/



More information about the XML-SIG mailing list