[XML-SIG] Re: xml.dom.ext.reader.HtmlLib memory leak?

Thu Aug 26 17:01:35 CEST 2004

<xmlsig at codeweld.com> wrote:

> Apart from that, I just think a "dom" is invaluable when there is a need to
> process a rather complex markup with all leaves, say for example when you
> implement a browser of sorts. Dom-view springs to mind. Use it on a few big
> websites for a while and the process starts to lag your computer because it
> grows in the hundreds of megabytes.

Does the leak has any relation to the size of the page you're parsing?

The sgmlop parser in pyxml is a fork of the pythonware/effbot.org version, and I don't
think it supports garbage collection.  (version 1.1 of the pythonware/effbot.org does).

This means that code using it *must* make sure to explicitly kill the parse object when
parsing is done.

I don't have PyXML on this machine, but Google found this page:

    http://aspn.activestate.com/ASPN/Mail/Message/xml-checkins/678664

which contains this initialization code:

    def initParser(self, parser):
        self._parser = parser
        self._parser.register(self)
        return

which creates a cycle: self contains a reference to the parser, which contains
references to bound methods, which contain references back to self.

To break the cycle, you must arrange for the code to do e.g.

        self._parser = None

when you're done parsing.

Alternatively, you could probably switch to the effbot.org version of sgmlop:

    http://effbot.org/downloads#sgmlop

(I haven't tested this with PyXML, but it might work.  Or not.)

</F>