Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?
Nikita the Spider
NikitaTheSpider at gmail.com
Thu Oct 4 18:14:40 CEST 2007
In article <ActMi.30614$eY.11375 at newssvr13.news.prodigy.net>,
John Nagle <nagle at animats.com> wrote:
> Filip Salomonsson wrote:
> > On 02/10/2007, John Nagle <nagle at animats.com> wrote:
> >> But there's something in there now that robotparser doesn't like.
> >> Any ideas?
> > Wikipedia denies _all_ access for the standard urllib user agent, and
> > when the robotparser gets a 401 or 403 response when trying to fetch
> > robots.txt, it is equivalent to "Disallow: *".
> > http://infix.se/2006/05/17/robotparser
> That explains it. It's an undocumented feature of "robotparser",
> as is the 'errcode' variable. The documentation of "robotparser" is
> silent on error handling (can it raise an exception?) and should be
Robotparser is probably following the never-approved RFC for robots.txt
which is the closest thing there is to a standard. It says, "On server
response indicating access restrictions (HTTP Status Code 401 or 403) a
robot should regard access to the site completely restricted."
If you're interested, I have a replacement for the robotparser module
that works a little better (IMHO) and which you might also find better
documented. I'm using it in production code:
Whole-site HTML validation, link checking and more
More information about the Python-list