[Patches] [Patch #102229] a better robotparser.py module

noreply@sourceforge.net noreply@sourceforge.net
Thu, 04 Jan 2001 18:31:23 -0800


Patch #102229 has been updated. 

Project: python
Category: Modules
Status: Open
Submitted by: calvin
Assigned to : montanaro
Summary: a better robotparser.py module

Follow-Ups:

Date: 2001-Jan-04 18:31
By: gvanrossum

Comment:
Skip, back to you.  Please work with the author on an acceptable version. 
You can check it in once you two agree.

-------------------------------------------------------

Date: 2001-Jan-04 17:43
By: montanaro

Comment:
I fixed the robots.txt file, but I think you should parse
files without the requisite blank lines (be lenient in what
you accept and strict in what you generate).  The
user-agent line can serve as an implicit separator between
one record and the next.

Skip

-------------------------------------------------------

Date: 2001-Jan-04 15:51
By: calvin

Comment:
Changes:
- global debug variable in the test function
- redirection now works
- accidently printed "Allow" when I meant "Disallow". This has been fixed.

It parses the Musi-Cal robots.txt file correctly, but the robots.txt file
has syntax errors:
before each user-agent: line there has to be one or more empty lines.
-------------------------------------------------------

Date: 2001-Jan-04 13:05
By: montanaro

Comment:
I apologize for taking so long to take a look at this.
I was reminded of it when I saw the switch from me to Guido.

I spent a little time fiddling with this module today.  I'm
not satisfied that it works as advertised.  Here are a
number of problems I found:

  * in the test function, the debug variable is not 
    declared global, so setting it to 1 has no effect

  * it never seemed to properly handle redirections, so it
    never got from

    http://www.musi-cal.com/robots.txt

    to

    http://musi-cal.mojam.com/robots.txt

  * once I worked around the redirection problem it seemed
    to parse the Musi-Cal robots.txt file incorrectly.

I replaced httplib with urllib in the read method and
got erroneous results.  If you look at the above robots.txt
file you'll see that a bunch of email address harvesters
are explicitly forbidden (not that they pay attention to 
robots.txt!).  The following should print 0, but prints 1:

    print rp.can_fetch('ExtractorPro',     
                       'http://musi-cal.mojam.com/')

This is (at least in part) due to the fact that the
redirection never works.  In the version I modified to
use urllib, it displays incorrect permissions for things like
ExtractorPro:

  User-agent: ExtractorPro
  Allow: /

Note that the lines in the robot.txt file for ExtractorPro
are actually

  User-agent: ExtractorPro
  Disallow: /

Skip

-------------------------------------------------------

Date: 2000-Nov-02 09:40
By: calvin

Comment:
I have written a new RobotParser module 'robotparser2.py'.

This module is

o backward compatible with the old one

o makes correct useragent matching (is buggy in
  robotparser.py)

o strips comments correctly (is buggy in robotparser.py)

o uses httplib instead of urllib.urlopen() to catch HTTP
  connect errors correctly (is buggy in robotparser.py)
  
o implements not only the draft at

http://info.webcrawler.com/mak/projects/robots/norobots.html
  but also the new one at
  http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html


Bastian Kleineidam

-------------------------------------------------------

Date: 2000-Nov-02 11:14
By: gvanrossum

Comment:
Skip, can you comment on this?  
-------------------------------------------------------

-------------------------------------------------------
For more info, visit:

http://sourceforge.net/patch/?func=detailpatch&patch_id=102229&group_id=5470