[ python-Bugs-1208304 ] urllib2's urlopen() method causes a memory leak

Fri Oct 14 06:13:50 CEST 2005

Bugs item #1208304, was opened at 2005-05-25 05:20
Message generated for change (Comment added) made by holdenweb
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1208304&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Extension Modules
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Petr Toman (manekcz)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib2's urlopen() method causes a memory leak

Initial Comment:
It seems that the urlopen(url) methd of the urllib2 module 
leaves some undestroyable objects in memory.

Please try the following code:
==========================
if __name__ == '__main__':
  import urllib2
  a = urllib2.urlopen('http://www.google.com')
  del a # or a = None or del(a)

  # check memory on memory leaks
  import gc
  gc.set_debug(gc.DEBUG_SAVEALL)
  gc.collect()
  for it in gc.garbage:
    print it
==========================

In our code, we're using lots of urlopens in a loop and 
the number of unreachable objects grows beyond all 
limits :) We also tried a.close() but it didn't help.

You can also try the following:
==========================
def print_unreachable_len():
  # check memory on memory leaks
  import gc
  gc.set_debug(gc.DEBUG_SAVEALL)
  gc.collect()
  unreachableL = []
  for it in gc.garbage:
    unreachableL.append(it)
  return len(str(unreachableL))

if __name__ == '__main__':
  print "at the beginning", print_unreachable_len()

  import urllib2
  print "after import of urllib2", print_unreachable_len()

  a = urllib2.urlopen('http://www.google.com')
  print 'after urllib2.urlopen', print_unreachable_len()

  del a
  print 'after del', print_unreachable_len()
==========================

We're using WindowsXP with latest patches, Python 2.4
(ActivePython 2.4 Build 243 (ActiveState Corp.) based on
Python 2.4 (#60, Nov 30 2004, 09:34:21) [MSC v.1310 
32 bit (Intel)] on win32).

----------------------------------------------------------------------

Comment By: Steve Holden (holdenweb)
Date: 2005-10-14 00:13

Message:
Logged In: YES 
user_id=88157

The Windows 2.4.1 build doesn't show this error, but the
Cygwin 2.4.1 build does still have uncollectable objects
after a urllib2.urlopen(), so there may be a platform
dependency here. No 2.4.2 on Cygwin yet, so nothing
conclusive as lsof isn't available.

----------------------------------------------------------------------

Comment By: Brian Wellington (bwelling)
Date: 2005-08-15 14:13

Message:
Logged In: YES 
user_id=63197

The real problem we were seeing wasn't the memory leak, it
was a file descriptor leak.  Leaking references within the
interpreter is bad, but the garbage collector will
eventually notice that the system is out of memory and clean
them.  Leaking file descriptors is much worse, as gc won't
be triggered when the process has reached it's limit, and
the process will start failing with "Too many file descriptors".

To easily show this problem, run the following from an
interactive python interpreter:

import urllib2
f = urllib2.urlopen('http://www.google.com')
f.close()

and from another window, run "lsof -p <pid of interpreter>".
 It should show  a TCP socket in CLOSE_WAIT, which means the
file descriptor is still open.  I'm seeing weirdness on
Fedora Core 4 today that I didn't see last week where after
a few seconds, the file descriptor is listed as "can't
identify protocol" instead of TCP, but that's not too
relevant, since it's still open.

Repeating the urllib2.urlopen()/close() pairs of statements
in the interpreter will cause more fds to be leaked, which
can also be seen by lsof.

----------------------------------------------------------------------

Comment By: Sean Reifschneider (jafo)
Date: 2005-08-12 18:30

Message:
Logged In: YES 
user_id=81797

I've just tried it again using the current CVS version as
well as the version installed with Fedora Core 4, and in
both cases I was able to run over 100,000 retrievals of
http://127.0.0.1/test.html and http://127.0.0.1/google.html.
 test.html is just "it works" and google.html was generated
with "wget -O google.html http://google.com/".

I was able to reproduce this before, but now am not.  My
urllib2.py includes the r.recv=r.read line.  I have upgraded
from FC3 to FC4, could this be something related to an OS or
library interaction?  I was going to try to confirm the last
message, but now I can't reproduce the failure.

----------------------------------------------------------------------

Comment By: Brian Wellington (bwelling)
Date: 2005-08-11 22:22

Message:
Logged In: YES 
user_id=63197

We just ran into this same problem, and worked around it by
simply removing the 'r.recv = r.read' line in urllib2.py,
and creating a recv alias to the read function in
HTTPResponse ('recv = read' in the class).

Not sure if this is the best solution, but it seems to work.

----------------------------------------------------------------------

Comment By: Sean Reifschneider (jafo)
Date: 2005-06-28 23:52

Message:
Logged In: YES 
user_id=81797

I give up, this code is kind of a maze of twisty little
passages.  I did try doing "a.fp.close()" and that didn't
seem to help at all.  Couldn't really make any progress on
that though.  I also tried doing a "if a.headers.fp:
a.headers.fp.close()", which didn't do anything noticable.

----------------------------------------------------------------------

Comment By: Sean Reifschneider (jafo)
Date: 2005-06-28 23:27

Message:
Logged In: YES 
user_id=81797

I can reproduce this in both the python.org 2.4 RPM and in a
freshly built copy from CVS.  If I run a few thousand
urlopen()s, I get:

Traceback (most recent call last):
  File "/tmp/mt", line 26, in ?
  File "/tmp/python/dist/src/Lib/urllib2.py", line 130, in
urlopen
  File "/tmp/python/dist/src/Lib/urllib2.py", line 361, in open
  File "/tmp/python/dist/src/Lib/urllib2.py", line 379, in _open
  File "/tmp/python/dist/src/Lib/urllib2.py", line 340, in
_call_chain
  File "/tmp/python/dist/src/Lib/urllib2.py", line 1026, in
http_open
  File "/tmp/python/dist/src/Lib/urllib2.py", line 1001, in
do_open
urllib2.URLError: <urlopen error (24, 'Too many open files')>

Even if I do a a.close().  I'll investigate a bit further.

Sean

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2005-06-01 19:13

Message:
Logged In: YES 
user_id=11375

Confirmed.  The objects involved seem to be an HTTPResponse and the 
socket._fileobject wrapper; the assignment 'r.recv=r.read' around line 1013 
of urllib2.py seems to be critical to creating the cycle.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1208304&group_id=5470