Problem with Python's "robots.txt" file parser in module robotparser

John Nagle nagle at animats.com
Wed Jul 11 18:57:56 CEST 2007


   Python's "robots.txt" file parser may be misinterpreting a
special case.  Given a robots.txt file like this:

	User-agent: *
	Disallow: //
	Disallow: /account/registration
	Disallow: /account/mypro
	Disallow: /account/myint
	...

the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed.  Apparently  "Disallow: //" is being interpreted as
"Disallow: /".  Even the home page of the site is locked out. This may be incorrect.

This is the robots.txt file for "http://ibm.com".
Some IBM operating systems recognize filenames starting with "//"
as a special case like a network root, so they may be trying to
handle some problem like that.

The spec for "robots.txt", at

http://www.robotstxt.org/wc/norobots.html

says "Disallow: The value of this field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL that starts with
this value will not be retrieved."  That suggests that "//" should only disallow
paths beginning with "//".

				John Nagle
				SiteTruth



More information about the Python-list mailing list