[Python-bugs-list] [ python-Bugs-523041 ] Robotparser incorrectly applies regex
noreply@sourceforge.net
noreply@sourceforge.net
Wed, 06 Mar 2002 04:09:56 -0800
Bugs item #523041, was opened at 2002-02-26 17:14
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=523041&group_id=5470
Category: Python Library
Group: None
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Costas Malamas (cmalamas)
Assigned to: Nobody/Anonymous (nobody)
Summary: Robotparser incorrectly applies regex
Initial Comment:
Robotparser uses re to evaluate the Allow/Disallow
directives: nowhere in the RFC is it specified that
these directives can be regular expressions. As a
result, directives such as the following are mis-
interpreted:
User-Agent: *
Disallow: /.
The directive (which is actually syntactically
incorrect according to the RFC) denies access to the
root directory, but not the entire site; it should
pass robotparser but it fails (e.g.
http://www.pbs.org/robots.txt)
>From the draft RFC
(http://www.robotstxt.org/wc/norobots.html):
"The value of this field specifies a partial URL that
is not to be visited. This can be a full path, or a
partial path; any URL that starts with this value will
not be retrieved. For example, Disallow: /help
disallows both /help.html"
Also the final RFC excludes * as valid in the path
directive (http://www.robotstxt.org/wc/norobots-
rfc.html).
Suggested fix (also fixes bug #522898):
robotparser.RuleLine.applies_to becomes:
def applies_to(self, filename):
if not self.path:
self.allowance = 1
return self.path=="*" or self.path.find
(filename) == 0
----------------------------------------------------------------------
>Comment By: Costas Malamas (cmalamas)
Date: 2002-03-06 12:09
Message:
Logged In: YES
user_id=71233
calvin is right; the patch was incorrect. A better one
(and more tested by now):
def applies_to(self, filename):
if not self.path:
self.allowance = 1
return self.path=="*" or urllib.quote
(filename).startswith(self.path)
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-02-28 15:25
Message:
Logged In: YES
user_id=21627
This has been fixed in robotparser.py 1.11.
----------------------------------------------------------------------
Comment By: Bastian Kleineidam (calvin)
Date: 2002-02-27 14:11
Message:
Logged In: YES
user_id=9205
Patch is not good:
>>> print RuleLine("/tmp", 0).applies_to("/")
1
>>>
This would apply the filename "/" to rule "Disallow: /tmp".
I think it should be:
return self.path=="*" or filename.startswith(self.path)
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=523041&group_id=5470