[ python-Bugs-1682241 ] Problems with urllib2 read()

Tue Mar 20 09:59:48 CET 2007

Bugs item #1682241, was opened at 2007-03-16 17:00
Message generated for change (Comment added) made by lucas_malor
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1682241&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Lucas Malor (lucas_malor)
Assigned to: Nobody/Anonymous (nobody)
Summary: Problems with urllib2 read()

Initial Comment:
urllib2 objects opened with urlopen() does not have the method seek() as file objects. So reading only some bytes from opened urls is pratically forbidden.

An example: I tried to open an url and check if it's a gzip file. If IOError is raised I read the file (to do this I applied the #1675951 patch: https://sourceforge.net/tracker/index.php?func=detail&aid=1675951&group_id=5470&atid=305470 )

But after I tried to open the file as gzip, if it's not a gzip file the current position in the urllib object is on the second byte. So read() returns the data from the 3rd to the last byte. You can't check the header of the file before storing it on hd. Well, so what is urlopen() for? If I must store the file by url on hd and reload it, I can use urlretrieve() ...

----------------------------------------------------------------------

>Comment By: Lucas Malor (lucas_malor)
Date: 2007-03-20 09:59

Message:
Logged In: YES 
user_id=1403274
Originator: YES

> If you need to seek, you can wrap the file-like object in a
> StringIO (which is what urllib would have to do internally
> [...] )

I think it's really a bug, or at least a non-pythonic method.
I use the method you wrote, but this must be done manually,
and I don't know why. Actually without this "trick" you can't
handle url and file objects together as they don't work in
the same manner. I think it's not too complicated using the
internal StringIO object in urllib class when I must seek()
or use other file-like methods.

> You can check the type of the response content before you try
> to uncompress it via the Content-Encoding header of the
> response

It's not a generic solution

(thanks anyway for suggested solutions :) )

----------------------------------------------------------------------

Comment By: Zacherates (maenpaa)
Date: 2007-03-20 03:43

Message:
Logged In: YES 
user_id=1421845
Originator: NO

I'd contend that this is not a bug:
 * If you need to seek, you can wrap the file-like object in a StringIO
(which is what urllib would have to do internally, thus incurring the
StringIO overhead for all clients, even those that don't need the
functionality).
 * You can check the type of the response content before you try to
uncompress it via the Content-Encoding header of the response.  The
meta-data is there for a reason.

Check
http://www.diveintopython.org/http_web_services/gzip_compression.html for a
rather complete treatment of your use-case.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1682241&group_id=5470