urllib2 - 403 that _should_ not occur.

Sun Jan 11 22:25:38 EST 2009

On Jan 11, 2009, at 10:05 PM, James Mills wrote:

> On Mon, Jan 12, 2009 at 12:58 PM, Philip Semanchuk <philip at semanchuk.com 
> > wrote:
>>
>> On Jan 11, 2009, at 8:59 PM, James Mills wrote:
>>
>>> Hey all,
>>>
>>> The following fails for me:
>>>
>>>>>> from urllib2 import urlopen
>>>>>> f =
>>>>>> urlopen("http://groups.google.com/group/chromium-announce/feed/rss_v2_0_msgs.xml 
>>>>>> ")
>>>
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1, in <module>
>>> File "/usr/lib/python2.6/urllib2.py", line 124, in urlopen
>>>  return _opener.open(url, data, timeout)
>>> File "/usr/lib/python2.6/urllib2.py", line 389, in open
>>>  response = meth(req, response)
>>> File "/usr/lib/python2.6/urllib2.py", line 502, in http_response
>>>  'http', request, response, code, msg, hdrs)
>>> File "/usr/lib/python2.6/urllib2.py", line 427, in error
>>>  return self._call_chain(*args)
>>> File "/usr/lib/python2.6/urllib2.py", line 361, in _call_chain
>>>  result = func(*args)
>>> File "/usr/lib/python2.6/urllib2.py", line 510, in  
>>> http_error_default
>>>  raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
>>> urllib2.HTTPError: HTTP Error 403: Forbidden
>>>>>>
>>>
>>> However, that _same_ url works perfectly fine on the
>>> same machine (and same network) using any of:
>>> * curl
>>> * wget
>>> * elinks
>>> * firefox
>>>
>>> Any helpful ideas ?
>>
>> The remote server doesn't like your user agent?
>>
>> It'd be easier to help if you post a working sample.
>
> That was a working sample!

Oooops, I guess it is my brain that's not working, then! Sorry about  
that.

I tried your sample and got the 403. This works for me:

 >>> import urllib2
 >>> user_agent="Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0)  
Gecko/25250101"
 >>> url="http://groups.google.com/group/chromium-announce/feed/rss_v2_0_msgs.xml 
"
 >>> req = urllib2.Request(url, None, { 'User-Agent' : user_agent})
 >>> f = urllib2.urlopen(req)
 >>> s=f.read()
 >>> f.close()
 >>> print s
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0">
   <channel>
   <title>Chromium-Announce Google Group</title>
   <link>http://groups.google.com/group/chromium-announce</link>
   <description>This list is intended for important product  
announcements that affect the majority of
etc.

> Why Google would deny access to services by
> unknown User Agents is beyond me - especially
> since in most cases User Agents strings are not
> strict.

Some sites ban UAs that look like bots. I know there's a Java-based  
bot with a distinct UA that was really badly-behaved when visiting my  
server. Ignored robots.txt, fetched pages as quickly as it could etc.  
That was worthy of banning. FWIW, when I try the code above with a UA  
of "funny fish" it still works OK, so it looks like the  
groups.google.com server has it out for UAs with Python in them, not  
just unknown ones.

I'm sure that if you changed wget's UA string to something Pythonic it  
would start to fail too.

Cheers
Philip