[ python-Bugs-1016880 ] urllib.urlretrieve silently truncates downloads

Fri Dec 24 15:30:05 CET 2004

Bugs item #1016880, was opened at 2004-08-26 15:58
Message generated for change (Comment added) made by irmen
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1016880&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 6
Submitted By: David Abrahams (david_abrahams)
Assigned to: Johannes Gijsbers (jlgijsbers)
Summary: urllib.urlretrieve silently truncates downloads

Initial Comment:
The following script appears to be unreliable on all 
versions of Python we can find.  The file being 
downloaded is approximately 34 MB.  Browsers such as 
IE and Mozilla have no problem downloading the whole 
thing.

----

import urllib
import os

os.chdir('/tmp')
urllib.urlretrieve
('http://cvs.sourceforge.net/cvstarballs/boost-
cvsroot.tar.bz2',
                  'boost-cvsroot.tar.bz2')

----------------------------------------------------------------------

Comment By: Irmen de Jong (irmen)
Date: 2004-12-24 15:30

Message:
Logged In: YES 
user_id=129426

Suggested addition to the doc of urllib (liburllib.tex, if
I'm not mistaken):

"""

urlretrieve will raise IOError when it detects that the
amount of data available 
was less than the expected amount (which is the size
reported by a Content-Length
header). This can occur, for example, when the download is
interrupted.

The Content-Length is treated as a lower bound (just like
tools such as wget and 
Ffirefox appear to do): if there's more data to read,
urlretrieve reads more data, but 
if less data is available, it raises IOError.

If no Content-Length header was supplied, urlretrieve can
not check the size
of the data it has downloaded, and just returns it. In this
case you
just have to assume that the download was successful.
"""

----------------------------------------------------------------------

Comment By: Irmen de Jong (irmen)
Date: 2004-11-07 21:17

Message:
Logged In: YES 
user_id=129426

a patch is at 1062060 (raises IOError when download is
incomplete)

----------------------------------------------------------------------

Comment By: Irmen de Jong (irmen)
Date: 2004-11-07 20:47

Message:
Logged In: YES 
user_id=129426

Confirmed here (mandrakelinux 10.0, python 2.4b2)
However, I doubt it is a problem in urllib.urlretrieve,
because I tried downloading the file with wget, and got the
following:

[irmen at isengard tmp]$ wget -S
http://cvs.sourceforge.net/cvstarballs/boost-cvsroot.tar.bz2
--20:38:11-- 
http://cvs.sourceforge.net/cvstarballs/boost-cvsroot.tar.bz2
           => `boost-cvsroot.tar.bz2.1'
Resolving cvs.sourceforge.net... 66.35.250.207
Connecting to cvs.sourceforge.net[66.35.250.207]:80...
connected.
HTTP request sent, awaiting response...
 1 HTTP/1.1 200 OK
 2 Date: Sun, 07 Nov 2004 19:38:15 GMT
 3 Server: Apache/2.0.40 (Red Hat Linux)
 4 Last-Modified: Sat, 06 Nov 2004 15:11:39 GMT
 5 ETag: "b63d5b-25c3808-687d80c0"
 6 Accept-Ranges: bytes
 7 Content-Length: 39598088
 8 Content-Type: application/x-bzip2
 9 Connection: close

31% [=======================>                              
                      ] 12,665,616    60.78K/s    ETA 03:55

20:40:07 (111.60 KB/s) - Connection closed at byte 12665616.
Retrying.

--20:40:08-- 
http://cvs.sourceforge.net/cvstarballs/boost-cvsroot.tar.bz2
  (try: 2) => `boost-cvsroot.tar.bz2.1'
Connecting to cvs.sourceforge.net[66.35.250.207]:80...
connected.
HTTP request sent, awaiting response...

....... so the remote server just closed the connection
halfway trough! I suspect that a succesful download is sheer
luck.

Also, the download loop in urllib looks fine to me. It only
stops when the read() returns an empty result, and that
means EOF. 

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-08-26 22:04

Message:
Logged In: YES 
user_id=80475

Followed the same procedure (no chdir, add a hook) but
bombed out at 9.1Mb:

 . . .
(1117, 8192, 34520156)
('boost-cvsroot.tar.bz2', <httplib.HTTPMessage instance at
0x00B1E4B8>)

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-08-26 20:52

Message:
Logged In: YES 
user_id=31435

Hmm.  I don't know anything about this, but thought I'd just 
try it.  Didn't chdir(), did add a reporthook:

def hook(*args):
    print args

WinXP Pro SP1, current CVS Python, cable modem over a 
wireless router.  Output looked like this:

(0, 8192, 34520156)
(1, 8192, 34520156)
(2, 8192, 34520156)
...
(4213, 8192, 34520156)
(4214, 8192, 34520156)
(4215, 8192, 34520156)

Had the whole file when it ended:

> wc boost-cvsroot.tar.bz2
 125368  765656 34520156 boost-cvsroot.tar.bz2

*Maybe* adding the reporthook changed timing in some 
crucial way.  Don't know.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-08-26 19:09

Message:
Logged In: YES 
user_id=80475

Confirmed.  On Py2.4 (current CVS), I got 12.7 Mb before the
connection closed.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1016880&group_id=5470