urllib2 - 403 that _should_ not occur.
Philip Semanchuk
philip at semanchuk.com
Sun Jan 11 22:25:38 EST 2009
On Jan 11, 2009, at 10:05 PM, James Mills wrote:
> On Mon, Jan 12, 2009 at 12:58 PM, Philip Semanchuk <philip at semanchuk.com
> > wrote:
>>
>> On Jan 11, 2009, at 8:59 PM, James Mills wrote:
>>
>>> Hey all,
>>>
>>> The following fails for me:
>>>
>>>>>> from urllib2 import urlopen
>>>>>> f =
>>>>>> urlopen("http://groups.google.com/group/chromium-announce/feed/rss_v2_0_msgs.xml
>>>>>> ")
>>>
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1, in <module>
>>> File "/usr/lib/python2.6/urllib2.py", line 124, in urlopen
>>> return _opener.open(url, data, timeout)
>>> File "/usr/lib/python2.6/urllib2.py", line 389, in open
>>> response = meth(req, response)
>>> File "/usr/lib/python2.6/urllib2.py", line 502, in http_response
>>> 'http', request, response, code, msg, hdrs)
>>> File "/usr/lib/python2.6/urllib2.py", line 427, in error
>>> return self._call_chain(*args)
>>> File "/usr/lib/python2.6/urllib2.py", line 361, in _call_chain
>>> result = func(*args)
>>> File "/usr/lib/python2.6/urllib2.py", line 510, in
>>> http_error_default
>>> raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
>>> urllib2.HTTPError: HTTP Error 403: Forbidden
>>>>>>
>>>
>>> However, that _same_ url works perfectly fine on the
>>> same machine (and same network) using any of:
>>> * curl
>>> * wget
>>> * elinks
>>> * firefox
>>>
>>> Any helpful ideas ?
>>
>> The remote server doesn't like your user agent?
>>
>> It'd be easier to help if you post a working sample.
>
> That was a working sample!
Oooops, I guess it is my brain that's not working, then! Sorry about
that.
I tried your sample and got the 403. This works for me:
>>> import urllib2
>>> user_agent="Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0)
Gecko/25250101"
>>> url="http://groups.google.com/group/chromium-announce/feed/rss_v2_0_msgs.xml
"
>>> req = urllib2.Request(url, None, { 'User-Agent' : user_agent})
>>> f = urllib2.urlopen(req)
>>> s=f.read()
>>> f.close()
>>> print s
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0">
<channel>
<title>Chromium-Announce Google Group</title>
<link>http://groups.google.com/group/chromium-announce</link>
<description>This list is intended for important product
announcements that affect the majority of
etc.
> Why Google would deny access to services by
> unknown User Agents is beyond me - especially
> since in most cases User Agents strings are not
> strict.
Some sites ban UAs that look like bots. I know there's a Java-based
bot with a distinct UA that was really badly-behaved when visiting my
server. Ignored robots.txt, fetched pages as quickly as it could etc.
That was worthy of banning. FWIW, when I try the code above with a UA
of "funny fish" it still works OK, so it looks like the
groups.google.com server has it out for UAs with Python in them, not
just unknown ones.
I'm sure that if you changed wget's UA string to something Pythonic it
would start to fail too.
Cheers
Philip
More information about the Python-list
mailing list