problem using urllib2: \n

Wed Sep 24 04:31:57 EDT 2003

jjl at pobox.com (John J. Lee) wrote in message news:<874qz3cqs2.fsf at pobox.com>...
> bmiras at yahoo.com writes:
> 
> > I've got a problem using urllib2 to get a web page.
> > I'm going through a proxy using user/password authentification
> > and i'm trying to get a page asking for a HTTP authentification.
> > And I'm using python 2.3
> > 
> > Here is an exemple of the piece of code I use:
> > 
> > import urllib2
> > #Proxy handler
> > proxy_handler = urllib2.ProxyHandler({"http" :
> > "http://proxyuser:proxypassword@myproxy:8050"})
> > 
> > #Site auth handler
> > 
> > site_auth_handler = urllib2.HTTPBasicAuthHandler();
> > site_auth_handler.add_password( "This Realm", "www.mysite.com",
> > "siteuser", "sitepassword" );
> > 
> > 
> > opener = urllib2.build_opener( site_auth_handler,
> > urllib2.HTTPRedirectHandler, urllib2.HTTPHandler , proxy_handler)
> > urllib2.install_opener(opener)
> 
> Looks OK (but I don't use a proxy, nor basic auth very often...).
> 
> Just as a BTW: you don't need to pass HTTPHandler or
> HTTPRedirectHandler in there: build_opener adds them whether you ask
> for them or not.
> 
> 
> > req = urllib2.Request('http://www.mysite.com/protectedpage')
> > page = urllib2.urlopen(req)
> > 
> > I got a 401 error.
> 
> So presumably your proxy is happy, but the site is not.  Could you
> test that theory by urlopen()ing a URL that *doesn't* require any
> authentication?  Just:
> 
> # ...your code up to install_opener goes here...
> print urllib2.urlopen("http://www.python.org/").read()
> 
> 
It's ok with URL that doesn't require authentication

> > Analyzing the request using 'strace' I can see the following request
> > sent to the proxy:
> > 
> > GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
> > www.mysite.com\r\nUser-agent:
> > Python-urllib/2.0a1\r\nProxy-authorization: Basic
> > XXX\n\r\nAuthorization: Basic
> > YYY\n\r\n\r\n
> 
> (You probably didn't want to post your usernames and passwords to a
> public newsgroup.  They're reversibly encoded, so anyone can decode
> them.  I've replaced them with XXX and YYY in the quote above.)
> 
> 
> > As you can see there is additionnal \n sent to the server just after
> > the Proxy-authorization and the Authorization fields. I think that in
> > this case the web server get only this part:
> > 
> > GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
> > www.mysite.com\r\nUser-agent:
> > Python-urllib/2.0a1\r\nProxy-authorization: Basic
> > XXX\n\r\n
> > 
> > and so send me back an error 401, since I'm not authenticated for the
> > site.
> 
> Hmm.  That \n does seem likely to be wrong, but I'm not certain.
> 
> The urllib2 code appears to duplicate the code for base64 encoding for
> proxy basic authorization (in ProxyBasicAuthHandler and ProxyHandler),
> and the code differs between the two classes :-(.  [It looks like PBAH
> responds to 407, and ProxyHandler always sends Proxy-Authorization if
> it's in the proxy's URL.]  And in fact, only one of them does a
> .strip() on the base64 encoded string (they also differ in quoting).
> However, the Authorization: header appears to be generated only in one
> place (AbstractBasicAuthHandler.retry_http_basic_auth), which *does*
> strip, but you've got a \n there, too.  So, I don't understand where
> that \n is coming from.  I'd try sticking some print statements in
> there to find out what's going on.
> 

I've done a wrong copy/paste
there is no additional \n after Authorization field
but there an additional \n for Proxy-Authorization

I've used HTTPBasicAuthHandler since you said the code is different
and it worked fine!!!
I think the conclusion is that the strip call in the ProxyHandler code
is missing. Is it necessary to report it as a bug?

> 
> > I had a look in the urllib2.py . I think that base64.encodestring add
> > an \n at the end of the string. It's the case in the method
> > 'proxy_open':
> > 
> >     def proxy_open(self, req, proxy, type):
> >         orig_type = req.get_type()
> >         type, r_type = splittype(proxy)
> >         host, XXX = splithost(r_type)
> >         if '@' in host:
> >             user_pass, host = host.split('@', 1)
> >             if ':' in user_pass:
> >                 user, password = user_pass.split(':', 1)
> >                 user_pass = base64.encodestring('%s:%s' %
> > (unquote(user),
> >                                                           
> > unquote(password)))
> >                 req.add_header('Proxy-authorization', 'Basic ' +
> > user_pass)
> >         host = unquote(host)
> >         req.set_proxy(host, type)
> >    ...
> > 
> > I think it should be:
> > 
> > user_pass = base64.encodestring('%s:%s' % (unquote(user),
> >                                            unquote(password))).split()
> 
> You mean strip, not split?
> 

Yes strip, sorry,

> Try debugging a bit, find out what's really going on.  Just copy
> urllib2.py to your current directory (so it'll override the installed
> standard library's copy), and stick some print statements in there.
> 
> 
> > have you any other clue?
> [...]
> 
> You could try sniffing what Mozilla sends, too.
> 
I've done better: telnet myproxy 8050

GET http://www.mysite.com/protectedpage HTTP/1.0
Host: www.mysite.com
User-agent: Python-urllib/2.0a1
Proxy-authorization: Basic XXX
Authorization: Basic YYY

And it worked fine.

> If you get this working, please look at the doc patch here
> 
> http://www.python.org/sf/798244
> 
> 
> test it, and post a comment to say whether or not it's correct (and
> which examples you tried -- preferably all of them ;).
> 
> 
> John