urllib2, https and gzipped files

Barry barrynyc at gmail.com
Sat Sep 19 23:34:19 EDT 2009


I'm trying to use urllib2 to download some gzipped files from an https
server, but I cannot correctly open the file. It happens to be an mbox
file -- a mailing list archive to be exact.

Upon calling open, the file starts to be unzipped. Content-Length is
read as the length of the first post in the archive and exactly that
amount of text is downloaded and that's it.

I can do this manually in a browser, but cannot do it any other way. I
couldn't find a solution searching on the web, but tested wget and
curl -- and both of them mess up in a similar way as my python code.
curl is exactly the same. It gets the first few thousand bytes as text
and stops. wget, tries a second time and downloads the remaining
number of bytes to match the actual compressed file size, but the
second part just looks like random bytes.

The same code works on other sites with the same archive; but the
difference is that they are http connections, not https.

Any ideas?

Barry



More information about the Python-list mailing list