[ python-Bugs-1772481 ] urllib2 hangs with some documents.

SourceForge.net noreply at sourceforge.net
Mon Aug 13 05:07:44 CEST 2007


Bugs item #1772481, was opened at 2007-08-12 06:52
Message generated for change (Comment added) made by orsenthil
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1772481&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Creature (acreature)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib2 hangs with some documents.

Initial Comment:
While working on a web spider I encountered the following page that causes the read() call of a urllib2 response to fail. It uses 100% of the CPU and does not seem to ever return. I have this behaviour on Python 2.4.4, but several people on 2.5.1 have tried the code below and reported the same behaviour. 

By the way, the page it uses is a porn site, but please don't get hung up on that fact. This is a data processing issue, not a subject matter issue. 

This test case is attached as a file, but is also available at http://pastebin.com/d6f98618f . Please note that the user-agent masquerading is present to rule out any issues with the server returning different data to different clients; commenting out the line so Python sends the standard headers still results in the issue occuring. 

----------------------------------------------------------------------

Comment By: O.R.Senthil Kumaran (orsenthil)
Date: 2007-08-13 08:37

Message:
Logged In: YES 
user_id=942711
Originator: NO

Yes, I could verify the issue as well as the fix.
Please submit a patch to patches or if someone with trunk access can make
the change immediately.

----------------------------------------------------------------------

Comment By: Creature (acreature)
Date: 2007-08-12 07:02

Message:
Logged In: YES 
user_id=1407924
Originator: YES

It seems that a fix to this issue is to change line 525 to add "or line ==
''" on httplib.py in Python 2.4.4:

# read and discard trailer up to the CRLF terminator
### note: we shouldn't have any trailers!
    while True:
        line = self.fp.readline()
        if line == '\r\n' or line == '':
            break

I'm told that this is found on line 574 on Python 2.5.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1772481&group_id=5470


More information about the Python-bugs-list mailing list