[Python-bugs-list] [ python-Bugs-508157 ] urllib.urlopen results.readline is slow

Sun, 17 Mar 2002 23:05:45 -0800

Bugs item #508157, was opened at 2002-01-24 13:48
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=508157&group_id=5470

Category: Python Library
Group: Python 2.2
>Status: Closed
>Resolution: Invalid
Priority: 5
Submitted By: Keith Davidson (kbdavidson)
>Assigned to: Greg Stein (gstein)
Summary: urllib.urlopen results.readline is slow

Initial Comment:
The socket file object underlying the return from 
urllib.urlopen() is opened without any buffering 
resulting in very slow performance of results.readline
().  The specific problem is in the 
httplib.HTTPResponse constructor.  It calls 
sock.makefile() with a 0 for the buffer size.  Forcing 
the buffer size to 4096 results in the time for 
calling readline() on a 60K character line to go from 
16 seconds to .27 seconds (there is other processing 
going on here but the magnitude of the difference is 
correct).

I am using Python 2.0 so I can not submit a patch 
easily but the problem appears to still be present in 
the 2.2 source.  The specific change is to change the 
0 in sock.makefile() to 4096 or some other reasonable 
buffer size:

class HTTPResponse:
    def __init__(self, sock, debuglevel=0):
        self.fp = sock.makefile('rb', 0)    <= change 
to 4096
        self.debuglevel = debuglevel

----------------------------------------------------------------------

>Comment By: Greg Stein (gstein)
Date: 2002-03-17 23:05

Message:
Logged In: YES 
user_id=6501

Andrew is correct. The buffering was turned off
(specifically) so that the reading of one response will not
consume a portion of the next response.

Jeremy first found the over-reading problem a couple years
ago, and we solved the problem then. To read the thread:
http://mail.python.org/pipermail/python-dev/2000-June/004409.html

After the HTTP response's headers have been read, then it
can be determined whether the connection will be closed at
the end of the response, or whether it will stay open for
more requests to be performed. If it is going to be closed,
then it is possible to use buffering. Of course, that is
*after* the headers, so you'd actually need to do a second
dup/makefile and turn on buffering. This also means that you
wouldn't get the buffering benefits while reading headers.

It could be possible to redesign the connection/response
classes to keep a buffer in the connection object, but that
is quite a bit more involved. It also complicates the
passing of the socket to the response object in some cases.

I'm going to close this as "invalid" since the proposed fix
would break the code.

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2002-03-14 15:32

Message:
Logged In: YES 
user_id=11375

Greg Stein originally wrote it;  I'll ping him.

I suspect it might be because of 
HTTP pipelining; if multiple
responses will be returned over a socket, you 
probably can't use buffering because the buffer might consume the end of 
response #1 and the start of response #2.  

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-01-25 06:12

Message:
Logged In: YES 
user_id=6380

I wonder why the author explicitly turned off buffering.
There probably was a reason? Without knowing why, we can't
just change it.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-01-24 13:54

Message:
Logged In: NO 

What platform?

--Guido (not logged in)

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=508157&group_id=5470