urllib behaves strangely
John J. Lee
jjlee at reportlab.com
Tue Jun 13 00:53:56 CEST 2006
Duncan Booth <duncan.booth at invalid.invalid> writes:
> Gabriel Zachmann wrote:
> > Here is a very simple Python script utilizing urllib:
> > "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
> > cal"
> > print url
> > print
> > file = urllib.urlopen( url )
> > However, when i ecexute it, i get an html error ("access denied").
> > On the one hand, the funny thing though is that i can view the page
> > fine in my browser, and i can download it fine using curl.
> > On the other hand, it must have something to do with the URL because
> > urllib works fine with any other URL i have tried ...
> It looks like wikipedia checks the User-Agent header and refuses to send
> pages to browsers it doesn't like. Try:
If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
More information about the Python-list