urllib performance issue on FreeBSD 4.x
I've been following up a thread on python-list about lousy performance of urllib.urlopen(...).read() on FreeBSD 4.x comparted to using wget to retrieve the same file. I've determined that the following patch (against 2.2.2) makes an enormous difference in throughput: -----8<-----8<-----8<----- *** Lib/httplib.py.orig Mon Oct 7 11:18:17 2002 --- Lib/httplib.py Sun Nov 24 14:44:16 2002 *************** *** 210,216 **** # See RFC 2616 sec 19.6 and RFC 1945 sec 6 for details. def __init__(self, sock, debuglevel=0, strict=0): ! self.fp = sock.makefile('rb', 0) self.debuglevel = debuglevel self.strict = strict --- 210,216 ---- # See RFC 2616 sec 19.6 and RFC 1945 sec 6 for details. def __init__(self, sock, debuglevel=0, strict=0): ! self.fp = sock.makefile('rb', -1) self.debuglevel = debuglevel self.strict = strict -----8<-----8<-----8<----- Without this patch, d/l a 4MB file from localhost gets a bit over 110kB/s, with the patch gets 4-5.5MB/s on the same system (FBSD 4.4 SMP, 2xC300A, 128MB RAM, ATA66 HD). My question: - why is the socket.fp being set to unbuffered? I can't check the FBSD library source at the moment (and can't get to the RFC's mentioned above either at the moment for that matter), and can only speculate that fread() is resorting to reading from the socket a character at a time. So I'm not sure whether this should be treated as a FreeBSD issue or/and a Python issue. Another poster in the same thread mentions seeing somewhat similar performance problems on Win2k, although not nearly as bad. FWIW, my test script is -----8<-----8<-----8<----- import time import urllib t1 = time.time() u = urllib.urlopen("http://localhost/big_file").read() t2 = time.time() print 'throughput: %f kB/s' % (len(u) / (t2 - t1)) -----8<-----8<-----8<----- Reactions? -- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au | Snail: PO Box 370 andymac@pcug.org.au | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Andrew MacIntyre
- why is the socket.fp being set to unbuffered?
I believe it prevents deadlocks. In HTTP/1.1, the server may not close the connection, but may refuse to send more data until it receives the next command. So you must be very careful to not read more data from the socket than the protocol guarantees you to be present. I believe stdio would not apply the necessary care: if it wants to fill the buffer, it will block. It won't see EOF because there is none, but there won't be any more data, because the server won't send any until we send the next command. We won't send the next command, since we are blocked. Regards, Martin
I've been following up a thread on python-list about lousy performance of urllib.urlopen(...).read() on FreeBSD 4.x comparted to using wget to retrieve the same file.
I've determined that the following patch (against 2.2.2) makes an enormous difference in throughput:
-----8<-----8<-----8<----- *** Lib/httplib.py.orig Mon Oct 7 11:18:17 2002 --- Lib/httplib.py Sun Nov 24 14:44:16 2002 *************** *** 210,216 **** # See RFC 2616 sec 19.6 and RFC 1945 sec 6 for details.
def __init__(self, sock, debuglevel=0, strict=0): ! self.fp = sock.makefile('rb', 0) self.debuglevel = debuglevel self.strict = strict
--- 210,216 ---- # See RFC 2616 sec 19.6 and RFC 1945 sec 6 for details.
def __init__(self, sock, debuglevel=0, strict=0): ! self.fp = sock.makefile('rb', -1) self.debuglevel = debuglevel self.strict = strict
-----8<-----8<-----8<-----
Without this patch, d/l a 4MB file from localhost gets a bit over 110kB/s, with the patch gets 4-5.5MB/s on the same system (FBSD 4.4 SMP, 2xC300A, 128MB RAM, ATA66 HD).
My question:
- why is the socket.fp being set to unbuffered?
I can't make time for a full essay on the issue, but I believe that it must be unbuffered because some applications want to read until the end of the headers and then pass the file descriptor to a subprocess or to code that uses the socket directly. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido wrote:
Without this patch, d/l a 4MB file from localhost gets a bit over 110kB/s, with the patch gets 4-5.5MB/s on the same system
- why is the socket.fp being set to unbuffered?
I can't make time for a full essay on the issue, but I believe that it must be unbuffered because some applications want to read until the end of the headers and then pass the file descriptor to a subprocess or to code that uses the socket directly.
sounds like it would be a good idea to provide a subclass (or option) for applications that don't need that feature. </F>
On Sun, 24 Nov 2002, Fredrik Lundh wrote:
Without this patch, d/l a 4MB file from localhost gets a bit over 110kB/s, with the patch gets 4-5.5MB/s on the same system
- why is the socket.fp being set to unbuffered?
I can't make time for a full essay on the issue, but I believe that it must be unbuffered because some applications want to read until the end of the headers and then pass the file descriptor to a subprocess or to code that uses the socket directly.
sounds like it would be a good idea to provide a subclass (or option) for applications that don't need that feature.
Thanks for the info. I'll add preparing a patch for this to my projects list... -- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au | Snail: PO Box 370 andymac@pcug.org.au | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
participants (4)
-
Andrew MacIntyre
-
Fredrik Lundh
-
Guido van Rossum
-
martin@v.loewis.de