[Python-bugs-list] [ python-Bugs-690214 ] robotparser only applies first applicable rule

SourceForge.net noreply@sourceforge.net
Mon, 03 Mar 2003 12:22:22 -0800


Bugs item #690214, was opened at 2003-02-20 13:55
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=690214&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Pilgrim (f8dy)
>Assigned to: Skip Montanaro (montanaro)
Summary: robotparser only applies first applicable rule

Initial Comment:
robotparser robotparser.py::RobotFileParser::can_fetch 
currently returns the result of the first applicable rule.  It 
should loop through all rules looking for anything that 
disallows access.  For example, if your first rule applies 
to 'wget' and 'python' and disallows access to /dir1/, and 
your second rule is a 'python' rule that disallows access 
to /dir2/, robotparser will falsely claim that python is 
allowed to access /dir2/.

Patch against current source attached.

----------------------------------------------------------------------

Comment By: Bastian Kleineidam (calvin)
Date: 2003-03-03 06:46

Message:
Logged In: YES 
user_id=9205

Mark, if you dive into
http://www.robotstxt.org/wc/norobots-rfc.txt you'll note
that the first matching user-agent line as well as the first
matching allow or disallow line must be obeyed by the robot
(see 3.2.1 and 3.2.2).

Now, I am not opposed to disobey the above rfc, but there
are other arguments against your patch:
a) it breaks current implementations of robots.txt
(potentially disallowing access to sites)
b) your problem is easily solved by moving Disallow and/or
User-Agent lines to the top

Therefore my count is -1 for this patch.

Cheers, Bastian


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=690214&group_id=5470