[Python-bugs-list] [ python-Bugs-535474 ] xml.sax memory leak with ExpatParser

noreply@sourceforge.net noreply@sourceforge.net
Wed, 03 Apr 2002 23:14:06 -0800


Bugs item #535474, was opened at 2002-03-26 23:24
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=535474&group_id=5470

Category: XML
Group: Python 2.1.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Danny Yoo (dyoo)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: xml.sax memory leak with ExpatParser

Initial Comment:
I've isolated a memory leak in the ExpatParser that
deals with the destruction of ContentHandlers.  I'm
including my test program test_memory_leak.py that
tests the behavior --- I generate a bunch of
ContentParsers, and see if they get destroyed reliably.


This appears to affect Python 2.1.1 and 2.1.2. 
Thankfully, the leak appears to be fixed in 2.2.1c. 
Here's some of the test runs:

### Python 2.1.1:
[dyoo@tesuque dyoo]$ /opt/Python-2.1.1/bin/python
test_memory_leak.py
This is a test of an apparent XML memory leak.
Test1:



Test2:
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
###



### Python 2.1.2:
[dyoo@tesuque dyoo]$ /opt/Python-2.1.2/bin/python
test_memory_leak.py
This is a test of an apparent XML memory leak.
Test1:
TestParser destructed.
TestParser destructed.



Test2:
###


### Python 2.2.1c
[dyoo@tesuque dyoo]$ /opt/Python-2.2.1c2/bin/python
test_memory_leak.py
This is a test of an apparent XML memory leak.
Test1:
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.



Test2:
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
###





----------------------------------------------------------------------

>Comment By: Danny Yoo (dyoo)
Date: 2002-04-04 07:14

Message:
Logged In: YES 
user_id=49843

Martin is right: I need to retract part of my bug-report: I
just remembered that any classes that have a __del__ method
aren't automatically cleaned  by a gc.collect().  I
triggered a pseudo-Heisenbug during my testing.

After removing the __del__ method from my test class and by
using parseString() instead of feed(), I've verified that
the XML parsing isn't the source of my memory leak.  (Test
file test_memory_leak_2.py included.)

However, after further investigation, I did find the true
source of my problems.. in MySQLdb:

http://sourceforge.net/tracker/index.php?func=detail&aid=536624&group_id=22307&atid=374932

Thank you again for looking into this.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-04 07:05

Message:
Logged In: YES 
user_id=21627

Also, when parsing the large xml file: if you invoke
gc.collect() after each iteration, memory consumption will
go down, and not grow over time. The reason that GC does not
trigger automatically is that you allocate all the space
through strings. GC will be invoked after 1000 new container
objects have been allocated, but you exhaust the memory
before that - so either set the GC threshold down, or invoke
GC on your own.

For the specific application, it would be sufficient if
xml.sax.__init__.parse would invoke
parser.setContentHandler(None) after parsing has completed;
this should already break the cycle.

To solve the general problem, I like your suggestion of
using a separate locator.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-04 06:21

Message:
Logged In: YES 
user_id=21627

I think the problem is elsewhere. Danny's demo script
clearly is buggy; if you use the IncrementalParser
interface, you *must* invoke .close() at the end of the
parse run; else you get cyclic garbage.

The cyclic garbage collector will pick up that garbage; just
invoke gc.collect() after test1 and test2 to see all
TestParsers destroyed.

So I don't think any action on Python code is necessary as a
bug fix; if there are remaining problems, then they must be
in pyexpat.c. I'll investigate 2.52 and 2.54 as candidates
for backporting.

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-04 05:35

Message:
Logged In: YES 
user_id=3066

I'll note that the patch is against the release21-maint
branch of Python, and I've only tried it there.  It may need
changes for more recent versions of Python, but that branch
appears most critical since we're looking at a 2.1.3 release
next week.

OK, enough.  I'm heading to bed.

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-04 05:25

Message:
Logged In: YES 
user_id=3066

I've attached a patch.  I think this meets all the backward
compatibility requirements and is low-risk, and it removes
the circular reference.  So far I've only tested it against
the standard tests for Python 2.1.*; I'll try it tomorrow
with the sample test code, and think about a test that can
be added to the test suite.

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-04 05:07

Message:
Logged In: YES 
user_id=3066

Looking at the code, it's not quite so trivial as I'd
thought, but not entirely difficult either.  I started by
creating a locator that had a reference to the parser object
from xml.parser.expat, but that of course has references to
the ExpatParser, so the cycle still exists.

As long as we're trying to solve the problem for Python 2.1
and newer, though, we can use a locator object that has a
weakref to the ExpatParser object, thereby breaking the
cycle.  I like that.  ;-)

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-04 04:50

Message:
Logged In: YES 
user_id=3066

I don't remember if the cycle detector was enabled by
default in 2.1.* -- that all seems so long ago!

The content handler ends up being part of a circular
reference cycle, with the ExpatParser acting as it's own
locator object.  This happens because the parser references
the content handler, and hands a reference to itself for the
content handler to squirrel away as the locator.

I see two approaches to removing this dependency.  The first
is simply to call setDocumentLocator(None) after calling
endDocument(), but that's fragile; it assumes the parse gets
that far.  The second is to use a separate object to provide
the locator to the content handler; this seems more robust
as it doesn't assume that the parse succeeds.

I'll start on a patch that uses the second approach.

Martin, do you see any other alternatives?  There will be a
2.1.3 release for other reasons, BTW, so this might make it in.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-03-28 22:48

Message:
Logged In: YES 
user_id=31435

Assigned to Fred, after he begged me to <wink>.

----------------------------------------------------------------------

Comment By: Danny Yoo (dyoo)
Date: 2002-03-28 22:37

Message:
Logged In: YES 
user_id=49843

Hi Martin,

Yikes; Sorry about that.  I've attached the file.

---


I did some more experimentation with xml.sax, and there does
appear to be a serious problem with object destruction, even
with Python 2.2.1c.

I'm working with a fairly large XML file located on the TIGR
(The Institute for Genomic Research) ftp site.  A sample
file would be something like:

ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/PSEUDOCHROMOSOMES/chr1.xml

(60 MBs)

and I noticed that my scripts were leaking memory.  I've
isolated the problem to what looks like a garbage collection
problem: it looks like my ContentHandlers are not getting
recycled.  Here's a simplified program:

###
import xml.sax
import glob
from cStringIO import StringIO


class FooParser(xml.sax.ContentHandler):
    def __init__(self):
        self.bigcontent = StringIO()

    def startElement(self, name, attrs):
        pass

    def endElement(self, name):
        pass

    def characters(self, chars):
        self.bigcontent.write(chars)


filename =
'/home/arabidopsis/bacs/20020107/PSEUDOCHROMOSOME/chr1.xml'
i = 0
while 1:
    print "Iteration %d" % i
    xml.sax.parse(open(filename), FooParser())
    i = i + 1
###

I've watched 'top', and the memory usage continues growing.
 Any suggestions?  Thanks!

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-03-27 12:24

Message:
Logged In: YES 
user_id=21627

Also, what kind of action do you expect. Chances are minimal
that there will be a 2.1.3 release, so why bother?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-03-27 12:23

Message:
Logged In: YES 
user_id=21627

There's no uploaded file!  You have to check the
checkbox labeled "Check to Upload & Attach File"
when you upload a file.

Please try again.

(This is a SourceForge annoyance that we can do
nothing about. :-( )

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=535474&group_id=5470