[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Wed Aug 25 22:32:31 CEST 2004

Quoting Uche Ogbuji <uche.ogbuji at fourthought.com>:

> On Fri, 2004-08-20 at 00:52, xmlsig at codeweld.com wrote:
> > Quoting Uche Ogbuji <uche.ogbuji at fourthought.com>:
> > > On Tue, 2004-08-17 at 05:59, xmlsig at codeweld.com wrote:
> > > > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
> > > > >
> > > > > This code leaks substancialy
> > > > >
> > > > > from xml.dom.ext.reader.HtmlLib import FromHtml
> > > > > import urllib
> > > > > from xml.dom import ext
> > > > > s = urllib.urlopen( 'http://www.google.com' ).read()
> > > > > while True:
> > > > >     root = FromHtml( s )
> > > > >     ext.ReleaseNode( root )
> > > > >
> > > > > However, this does not ( or only very minor )
> > > > >
> > > > > from xml.dom.ext.reader.Sax2 import Reader
> > > > > import urllib
> > > > > from xml.dom import ext
> > > > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml'
> ).read()
> > > > > while True:
> > > > >     reader = Reader()
> > > > >     root = reader.fromString( s )
> > > > >     ext.ReleaseNode( root )
> > > > >
> > > > > Any suggestions?
> > > >
> > > > Could anybody reproduce the leak?
> > > > Any suggestions what I do wrong?
> > >
> > > I haven't done much work in HtmlLib since it was rewritten to use
> > > sgmlop.  It will take some heavy digging to find the precise memory
> > > leak.  What's your overall problem?  Could you use Python 2.3's
> > > HTMLParser library instead?
> >
> > The overall problem is that the FromHtml call ( in this example )allocates
> some
> > 100-200 k per loop that are not freed for the runtime of the process. The
> > leak's bigger when no ReleaseNode call is made.
>
> By "overall problem" I mean what are you actually trying to do/achieve.
> Since no one has been able to step up to diagnose the memory leak, I'm
> looking to see whether there is another solution that would work for
> you.
>
> > I could of course use other means of extracting information from html, but
> I
> > thought it would not be needed to reinvent the wheel if somebody has
> already
> > written a html parser that spits out dom.
>
> Honestly, I don't think DOM is the way I would personally go about
> processing HTML, which is why I was trying to get at whether there was
> another way for you to meet your needs.
>
> I'm sorry that my workload is so heavy that there is no chance I could
> work on figuring out a 4DOM memory leak right now.
>
> Best of luck.

Thanks. Hm, The general task that got me started on this is to perpetualy
extract some information from a website. To specify the location of this
information with xpath is just a very nice convinience. Can I use xpath
expressions with other parsing-techniques too?

Apart from that, I just think a "dom" is invaluable when there is a need to
process a rather complex markup with all leaves, say for example when you
implement a browser of sorts. Dom-view springs to mind. Use it on a few big
websites for a while and the process starts to lag your computer because it
grows in the hundreds of megabytes.