Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?
filip.salomonsson at gmail.com
Tue Oct 2 16:10:02 CEST 2007
On 02/10/2007, John Nagle <nagle at animats.com> wrote:
> But there's something in there now that robotparser doesn't like.
> Any ideas?
Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".
It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
wiki markup to HTML).
More information about the Python-list