urlopen returns forbidden
clp2 at rebertia.com
Mon Feb 28 07:19:18 CET 2011
On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw <monkey at joemoney.net> wrote:
> I have a working urlopen routine which opens
> a url, parses it for <a> tags and prints out
> the links in the page. On some sites, wikipedia for
> instance, i get a
> HTTP error 403, forbidden.
> What is the difference in accessing the site through a web browser
> and opening/reading the URL with python urllib2.urlopen?
The User-Agent header (http://en.wikipedia.org/wiki/User_agent ).
"By default, the URLopener class sends a User-Agent header of
urllib/VVV, where VVV is the urllib version number."
Some sites block obvious non-search-engine bots based on their HTTP
User-Agent header value.
You can override the urllib default:
Sidenote: Wikipedia has a proper API for programmatic browsing, likely
hence why it's blocking your program.
More information about the Python-list