Decompressing a file retrieved by URL seems too complex
Thomas Jollans
thomas at jollybox.de
Thu Aug 12 17:40:27 EDT 2010
On Thursday 12 August 2010, it occurred to John Nagle to exclaim:
> (Repost with better indentation)
Good, good.
>
> def readurl(url) :
> if url.endswith(".gz") :
The file name could be anything. You should be checking the reponse Content-
Type header -- that's what it's for.
> nd = urllib2.urlopen(url,timeout=TIMEOUTSECS)
> td1 = tempfile.TemporaryFile() # compressed file
You can keep the whole thing in memory by using StringIO.
> td1.write(nd.read()) # fetch and copy file
You're reading the entire fire into memory anyway ;-)
> nd.close() # done with network
> td2 = tempfile.TemporaryFile() # decompressed file
Okay, maybe there is somthing missing from GzipFile -- but still you could use
StringIO again, I expect.
> Nor is the output descriptor from gzip general; it fails
> on "readline", but accepts "read".
>>> from gzip import GzipFile
>>> GzipFile.readline
<unbound method GzipFile.readline>
>>> GzipFile.readlines
<unbound method GzipFile.readlines>
>>> GzipFile.__iter__
<unbound method GzipFile.__iter__>
>>>
What exactly is it that's failing, and how?
> td1.seek(0) # rewind
> gd = gzip.GzipFile(fileobj=td1, mode="rb") # wrap unzip
> td2.write(gd.read()) # decompress file
> td1.close() # done with compressed copy
> td2.seek(0) # rewind
> return(td2) # return file object for compressed object
> else :
> return(urllib2.urlopen(url,timeout=TIMEOUTSECS))
More information about the Python-list
mailing list