[Tutor] getting binary file from website with custom header

Sat Jan 29 07:05:08 CET 2011

On 1/29/11, Steven D'Aprano <steve at pearwood.info> wrote:
> A few more comments...
>
> Alex Hall wrote:
>> Hello,
>> I am continuing to work on that api wrapper... I never realized how
>> little I know about urllib/urllib2! The idea of downloading from the
>> api is pretty easy: give it a url and a password and it gives you the
>> book. Here is a quote from the api documentation:
>> In addition the MD5 hash of the end user password must be passed in
>> the request via a "X-password" HTTP header.
>
> You might like to mention where this API comes from.
Sorry. http://api.bookshare.org.
>
>
>> Here is what I am doing. I use hashlib.md5(password).hexdigest() to
>> get the md5 of the password. "base" is just the base url, and
>> "destination" is just a local path. If it matters, this is an https
>> url.
>
> It may matter. urllib has some problems with https.
Wonderful... Time to find another package?
>
> What makes you think you should use the *hex* digest of the password,
> rather than some other format?
Honestly, it seemed the logical choice, and the api docs to not say
anything except to md5Sum() the password. I have tried it with and
without the hexdigest() and nothing changed. I will look to see what
else hashlib provides.
>
>
>>  user=urllib.quote(user) #user is an email address, so make it useable in
>> a url
>>   req=urllib2.Request(base+"download/for/"+user+"/content/"+str(id),
>> None, {"X-password":password})
>>   try:
>>    book=urllib2.urlopen(req)
>>    local=open(destination+str(id), "w") #name the file
>
> You should open binary files in binary. This may not matter, depending
> on your OS, but it never hurts to use "rb" and "wb" even when it doesn't
> matter.
Great point!
>
>>    local.write(book.read()) #save the blob to the local file
>>    local.close()
>>   except urllib2.HTTPError, e:
>>    print "HTTP error "+str(e.code)
>>   except urllib2.URLError, e:
>>    print "URL error: "+e.reason
>
>
> There is absolutely no point in catching an exception, only to print it.
True. Currently, I am trying to get this to work. Once it does I will
better my error-handling code. Still, I suppose the traceback would
help even more...
> You should only catch exceptions if you intended to *do something* other
> than print the error message which would have been printed anyway.
>
> In this case, there is good useful information in the HTTP exception,
> but not in the URL error. I recommend you change your code to:
>
> book = urllib2.urlopen(req)
> local = open(destination+str(id), "wb") #name the file
> try:
>      local.write(book.read()) #save the blob to the local file
> except urllib2.HTTPError, e:
>      print "HTTP error:",
>      print e.code  # 403 = permission denied, 401= not found, etc.
>      print e.msg  # this may give you a clue why the request was rejected
>      # uncomment the next line if you need more info
>      # print e.hdrs
> finally:
>      local.close()
Makes sense.
>
> If any other exception, including URLError, happens, Python will
> automatically print the traceback, including the exception.
>
> But other than these quibbles, the code looks fine to me.
>
>
>> I keep getting an error 403, which the api defines as a bad login
>> attempt. I am sure my password is right, though, so while I
>> investigate, I thought I would check that I am not only going about
>> this http header thing right but also getting the binary object right.
>> I am following an example I found pretty closely.
>
> The HTTP standard is that error 403 is request forbidden. This
> *strongly* suggests that either your username or password is wrong.
Could this be due to the wrong encoding, as you mentioned above? What
about that urllib.quote(user) for an email address?
>
> Or perhaps there are restrictions on how many times you can connect in a
> day, and you've exceeded it. Or your account has been closed. Or the
> website doesn't like the tool you are using to connect (Python). Or
> you've tried downloading too many files too quickly, and the webserver
> has locked you out.
I will change the useragent. The api says that each api key is limited
to three requests per second, no hourly or daily limits.
>
> My suggestion is:
>
> * Double check, *triple* check, that your username and password
>    are correct.
I am as sure as I can be about the plaintext, the encoding of the md5
and the urllib.quote() may be causing problems.
>
> * Write out the URL by hand (you can use Python for calculating
>    the MD5 sum, I'm not that cruel *grins*).
The url should be right. I am now at an error 500 instead of 403,
which is rather strange. I know 500=internal server error, but as far
as I know the api is not down.
>
> * Try using another commandline tool. If you're on Linux, you can
>    use curl or wget:
>
>    wget --header="X-password:<PASSWORD>" <URL>
>
>    with <PASSWORD> and <URL> replaced by the correct values.
>
>    curl will probably be similar.
Windows...
>
> * If wget works, great, go back to trying it from Python! If
>    not, inspect the error messages it prints. Try changing the
>    user-agent. Try setting the referer [sic] to the website's
>    home page.
>
>
> --
> Steven
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>


-- 
Have a great day,
Alex (msg sent from GMail website)
mehgcap at gmail.com; http://www.facebook.com/mehgcap