Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?
nagle at animats.com
Tue Oct 2 17:11:28 CEST 2007
Filip Salomonsson wrote:
> On 02/10/2007, John Nagle <nagle at animats.com> wrote:
>> But there's something in there now that robotparser doesn't like.
>> Any ideas?
> Wikipedia denies _all_ access for the standard urllib user agent, and
> when the robotparser gets a 401 or 403 response when trying to fetch
> robots.txt, it is equivalent to "Disallow: *".
That explains it. It's an undocumented feature of "robotparser",
as is the 'errcode' variable. The documentation of "robotparser" is
silent on error handling (can it raise an exception?) and should be
> It could also be worth mentioning that if you were planning on
> crawling a lot of Wikipedia pages, you may be better off downloading
> the whole thing instead: <http://download.wikimedia.org/>
> (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
> wiki markup to HTML).
This is for SiteTruth, the site rating system (see "sitetruth.com"),
and we never look at more than 21 pages per site. We're looking for
the name and address of the business behind the web site, and if we
can't find that after looking in 20 of the most obvious places to
look, it's either not there or not "prominently disclosed".
More information about the Python-list