Problem with Python's "robots.txt" file parser in module robotparser
nagle at animats.com
Wed Jul 11 18:57:56 CEST 2007
Python's "robots.txt" file parser may be misinterpreting a
special case. Given a robots.txt file like this:
the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be incorrect.
This is the robots.txt file for "http://ibm.com".
Some IBM operating systems recognize filenames starting with "//"
as a special case like a network root, so they may be trying to
handle some problem like that.
The spec for "robots.txt", at
says "Disallow: The value of this field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL that starts with
this value will not be retrieved." That suggests that "//" should only disallow
paths beginning with "//".
More information about the Python-list