problem using urllib2: \n
bmiras at yahoo.com
bmiras at yahoo.com
Wed Sep 24 04:31:57 EDT 2003
jjl at pobox.com (John J. Lee) wrote in message news:<874qz3cqs2.fsf at pobox.com>...
> bmiras at yahoo.com writes:
>
> > I've got a problem using urllib2 to get a web page.
> > I'm going through a proxy using user/password authentification
> > and i'm trying to get a page asking for a HTTP authentification.
> > And I'm using python 2.3
> >
> > Here is an exemple of the piece of code I use:
> >
> > import urllib2
> > #Proxy handler
> > proxy_handler = urllib2.ProxyHandler({"http" :
> > "http://proxyuser:proxypassword@myproxy:8050"})
> >
> > #Site auth handler
> >
> > site_auth_handler = urllib2.HTTPBasicAuthHandler();
> > site_auth_handler.add_password( "This Realm", "www.mysite.com",
> > "siteuser", "sitepassword" );
> >
> >
> > opener = urllib2.build_opener( site_auth_handler,
> > urllib2.HTTPRedirectHandler, urllib2.HTTPHandler , proxy_handler)
> > urllib2.install_opener(opener)
>
> Looks OK (but I don't use a proxy, nor basic auth very often...).
>
> Just as a BTW: you don't need to pass HTTPHandler or
> HTTPRedirectHandler in there: build_opener adds them whether you ask
> for them or not.
>
>
> > req = urllib2.Request('http://www.mysite.com/protectedpage')
> > page = urllib2.urlopen(req)
> >
> > I got a 401 error.
>
> So presumably your proxy is happy, but the site is not. Could you
> test that theory by urlopen()ing a URL that *doesn't* require any
> authentication? Just:
>
> # ...your code up to install_opener goes here...
> print urllib2.urlopen("http://www.python.org/").read()
>
>
It's ok with URL that doesn't require authentication
> > Analyzing the request using 'strace' I can see the following request
> > sent to the proxy:
> >
> > GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
> > www.mysite.com\r\nUser-agent:
> > Python-urllib/2.0a1\r\nProxy-authorization: Basic
> > XXX\n\r\nAuthorization: Basic
> > YYY\n\r\n\r\n
>
> (You probably didn't want to post your usernames and passwords to a
> public newsgroup. They're reversibly encoded, so anyone can decode
> them. I've replaced them with XXX and YYY in the quote above.)
>
>
> > As you can see there is additionnal \n sent to the server just after
> > the Proxy-authorization and the Authorization fields. I think that in
> > this case the web server get only this part:
> >
> > GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
> > www.mysite.com\r\nUser-agent:
> > Python-urllib/2.0a1\r\nProxy-authorization: Basic
> > XXX\n\r\n
> >
> > and so send me back an error 401, since I'm not authenticated for the
> > site.
>
> Hmm. That \n does seem likely to be wrong, but I'm not certain.
>
> The urllib2 code appears to duplicate the code for base64 encoding for
> proxy basic authorization (in ProxyBasicAuthHandler and ProxyHandler),
> and the code differs between the two classes :-(. [It looks like PBAH
> responds to 407, and ProxyHandler always sends Proxy-Authorization if
> it's in the proxy's URL.] And in fact, only one of them does a
> .strip() on the base64 encoded string (they also differ in quoting).
> However, the Authorization: header appears to be generated only in one
> place (AbstractBasicAuthHandler.retry_http_basic_auth), which *does*
> strip, but you've got a \n there, too. So, I don't understand where
> that \n is coming from. I'd try sticking some print statements in
> there to find out what's going on.
>
I've done a wrong copy/paste
there is no additional \n after Authorization field
but there an additional \n for Proxy-Authorization
I've used HTTPBasicAuthHandler since you said the code is different
and it worked fine!!!
I think the conclusion is that the strip call in the ProxyHandler code
is missing. Is it necessary to report it as a bug?
>
> > I had a look in the urllib2.py . I think that base64.encodestring add
> > an \n at the end of the string. It's the case in the method
> > 'proxy_open':
> >
> > def proxy_open(self, req, proxy, type):
> > orig_type = req.get_type()
> > type, r_type = splittype(proxy)
> > host, XXX = splithost(r_type)
> > if '@' in host:
> > user_pass, host = host.split('@', 1)
> > if ':' in user_pass:
> > user, password = user_pass.split(':', 1)
> > user_pass = base64.encodestring('%s:%s' %
> > (unquote(user),
> >
> > unquote(password)))
> > req.add_header('Proxy-authorization', 'Basic ' +
> > user_pass)
> > host = unquote(host)
> > req.set_proxy(host, type)
> > ...
> >
> > I think it should be:
> >
> > user_pass = base64.encodestring('%s:%s' % (unquote(user),
> > unquote(password))).split()
>
> You mean strip, not split?
>
Yes strip, sorry,
> Try debugging a bit, find out what's really going on. Just copy
> urllib2.py to your current directory (so it'll override the installed
> standard library's copy), and stick some print statements in there.
>
>
> > have you any other clue?
> [...]
>
> You could try sniffing what Mozilla sends, too.
>
I've done better: telnet myproxy 8050
GET http://www.mysite.com/protectedpage HTTP/1.0
Host: www.mysite.com
User-agent: Python-urllib/2.0a1
Proxy-authorization: Basic XXX
Authorization: Basic YYY
And it worked fine.
> If you get this working, please look at the doc patch here
>
> http://www.python.org/sf/798244
>
>
> test it, and post a comment to say whether or not it's correct (and
> which examples you tried -- preferably all of them ;).
>
>
> John
More information about the Python-list
mailing list