problem using urllib2: \n

Tue Sep 23 14:18:21 EDT 2003

bmiras at yahoo.com writes:

> I've got a problem using urllib2 to get a web page.
> I'm going through a proxy using user/password authentification
> and i'm trying to get a page asking for a HTTP authentification.
> And I'm using python 2.3
> 
> Here is an exemple of the piece of code I use:
> 
> import urllib2
> #Proxy handler
> proxy_handler = urllib2.ProxyHandler({"http" :
> "http://proxyuser:proxypassword@myproxy:8050"})
> 
> #Site auth handler
> 
> site_auth_handler = urllib2.HTTPBasicAuthHandler();
> site_auth_handler.add_password( "This Realm", "www.mysite.com",
> "siteuser", "sitepassword" );
> 
> 
> opener = urllib2.build_opener( site_auth_handler,
> urllib2.HTTPRedirectHandler, urllib2.HTTPHandler , proxy_handler)
> urllib2.install_opener(opener)

Looks OK (but I don't use a proxy, nor basic auth very often...).

Just as a BTW: you don't need to pass HTTPHandler or
HTTPRedirectHandler in there: build_opener adds them whether you ask
for them or not.

> req = urllib2.Request('http://www.mysite.com/protectedpage')
> page = urllib2.urlopen(req)
> 
> I got a 401 error.

So presumably your proxy is happy, but the site is not.  Could you
test that theory by urlopen()ing a URL that *doesn't* require any
authentication?  Just:

# ...your code up to install_opener goes here...
print urllib2.urlopen("http://www.python.org/").read()

> Analyzing the request using 'strace' I can see the following request
> sent to the proxy:
> 
> GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
> www.mysite.com\r\nUser-agent:
> Python-urllib/2.0a1\r\nProxy-authorization: Basic
> XXX\n\r\nAuthorization: Basic
> YYY\n\r\n\r\n

(You probably didn't want to post your usernames and passwords to a
public newsgroup.  They're reversibly encoded, so anyone can decode
them.  I've replaced them with XXX and YYY in the quote above.)

> As you can see there is additionnal \n sent to the server just after
> the Proxy-authorization and the Authorization fields. I think that in
> this case the web server get only this part:
> 
> GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
> www.mysite.com\r\nUser-agent:
> Python-urllib/2.0a1\r\nProxy-authorization: Basic
> XXX\n\r\n
> 
> and so send me back an error 401, since I'm not authenticated for the
> site.

Hmm.  That \n does seem likely to be wrong, but I'm not certain.

The urllib2 code appears to duplicate the code for base64 encoding for
proxy basic authorization (in ProxyBasicAuthHandler and ProxyHandler),
and the code differs between the two classes :-(.  [It looks like PBAH
responds to 407, and ProxyHandler always sends Proxy-Authorization if
it's in the proxy's URL.]  And in fact, only one of them does a
.strip() on the base64 encoded string (they also differ in quoting).
However, the Authorization: header appears to be generated only in one
place (AbstractBasicAuthHandler.retry_http_basic_auth), which *does*
strip, but you've got a \n there, too.  So, I don't understand where
that \n is coming from.  I'd try sticking some print statements in
there to find out what's going on.

> I had a look in the urllib2.py . I think that base64.encodestring add
> an \n at the end of the string. It's the case in the method
> 'proxy_open':
> 
>     def proxy_open(self, req, proxy, type):
>         orig_type = req.get_type()
>         type, r_type = splittype(proxy)
>         host, XXX = splithost(r_type)
>         if '@' in host:
>             user_pass, host = host.split('@', 1)
>             if ':' in user_pass:
>                 user, password = user_pass.split(':', 1)
>                 user_pass = base64.encodestring('%s:%s' %
> (unquote(user),
>                                                           
> unquote(password)))
>                 req.add_header('Proxy-authorization', 'Basic ' +
> user_pass)
>         host = unquote(host)
>         req.set_proxy(host, type)
>    ...
> 
> I think it should be:
> 
> user_pass = base64.encodestring('%s:%s' % (unquote(user),
>                                            unquote(password))).split()

You mean strip, not split?

Try debugging a bit, find out what's really going on.  Just copy
urllib2.py to your current directory (so it'll override the installed
standard library's copy), and stick some print statements in there.

> have you any other clue?
[...]

You could try sniffing what Mozilla sends, too.

If you get this working, please look at the doc patch here

http://www.python.org/sf/798244

test it, and post a comment to say whether or not it's correct (and
which examples you tried -- preferably all of them ;).

John