[Tutor] getting binary file from website with custom header

Steven D'Aprano steve at pearwood.info
Sat Jan 29 06:23:27 CET 2011


A few more comments...

Alex Hall wrote:
> Hello,
> I am continuing to work on that api wrapper... I never realized how
> little I know about urllib/urllib2! The idea of downloading from the
> api is pretty easy: give it a url and a password and it gives you the
> book. Here is a quote from the api documentation:
> In addition the MD5 hash of the end user password must be passed in
> the request via a "X-password" HTTP header.

You might like to mention where this API comes from.


> Here is what I am doing. I use hashlib.md5(password).hexdigest() to
> get the md5 of the password. "base" is just the base url, and
> "destination" is just a local path. If it matters, this is an https
> url.

It may matter. urllib has some problems with https.

What makes you think you should use the *hex* digest of the password, 
rather than some other format?


>  user=urllib.quote(user) #user is an email address, so make it useable in a url
>   req=urllib2.Request(base+"download/for/"+user+"/content/"+str(id),
> None, {"X-password":password})
>   try:
>    book=urllib2.urlopen(req)
>    local=open(destination+str(id), "w") #name the file

You should open binary files in binary. This may not matter, depending 
on your OS, but it never hurts to use "rb" and "wb" even when it doesn't 
matter.

>    local.write(book.read()) #save the blob to the local file
>    local.close()
>   except urllib2.HTTPError, e:
>    print "HTTP error "+str(e.code)
>   except urllib2.URLError, e:
>    print "URL error: "+e.reason


There is absolutely no point in catching an exception, only to print it. 
You should only catch exceptions if you intended to *do something* other 
than print the error message which would have been printed anyway.

In this case, there is good useful information in the HTTP exception, 
but not in the URL error. I recommend you change your code to:

book = urllib2.urlopen(req)
local = open(destination+str(id), "wb") #name the file
try:
     local.write(book.read()) #save the blob to the local file
except urllib2.HTTPError, e:
     print "HTTP error:",
     print e.code  # 403 = permission denied, 401= not found, etc.
     print e.msg  # this may give you a clue why the request was rejected
     # uncomment the next line if you need more info
     # print e.hdrs
finally:
     local.close()

If any other exception, including URLError, happens, Python will 
automatically print the traceback, including the exception.

But other than these quibbles, the code looks fine to me.


> I keep getting an error 403, which the api defines as a bad login
> attempt. I am sure my password is right, though, so while I
> investigate, I thought I would check that I am not only going about
> this http header thing right but also getting the binary object right.
> I am following an example I found pretty closely.

The HTTP standard is that error 403 is request forbidden. This 
*strongly* suggests that either your username or password is wrong.

Or perhaps there are restrictions on how many times you can connect in a 
day, and you've exceeded it. Or your account has been closed. Or the 
website doesn't like the tool you are using to connect (Python). Or 
you've tried downloading too many files too quickly, and the webserver 
has locked you out.

My suggestion is:

* Double check, *triple* check, that your username and password
   are correct.

* Write out the URL by hand (you can use Python for calculating
   the MD5 sum, I'm not that cruel *grins*).

* Try using another commandline tool. If you're on Linux, you can
   use curl or wget:

   wget --header="X-password:<PASSWORD>" <URL>

   with <PASSWORD> and <URL> replaced by the correct values.

   curl will probably be similar.

* If wget works, great, go back to trying it from Python! If
   not, inspect the error messages it prints. Try changing the
   user-agent. Try setting the referer [sic] to the website's
   home page.


-- 
Steven


More information about the Tutor mailing list