[Patches] [Patch #102229] a better robotparser.py module
noreply@sourceforge.net
noreply@sourceforge.net
Fri, 19 Jan 2001 14:57:07 -0800
Patch #102229 has been updated.
Project: python
Category: Modules
Status: Open
Submitted by: calvin
Assigned to : montanaro
Summary: a better robotparser.py module
Follow-Ups:
Date: 2001-Jan-19 14:57
By: gvanrossum
Comment:
Skip, if this is ok with you, can you check it in? (Unless you feel you
don't want to check it in because you still feel your module is better --
in that case we should probably drop it or reassign it...)
-------------------------------------------------------
Date: 2001-Jan-06 04:31
By: calvin
Comment:
Ok, some new changes:
- allow parsing of user-agent: lines without preceding blank
line
- two licenses available: Python 2.0 license or GPL license
- add some doc string for the classes
Bastian
-------------------------------------------------------
Date: 2001-Jan-04 18:31
By: gvanrossum
Comment:
Skip, back to you. Please work with the author on an acceptable version.
You can check it in once you two agree.
-------------------------------------------------------
Date: 2001-Jan-04 17:43
By: montanaro
Comment:
I fixed the robots.txt file, but I think you should parse
files without the requisite blank lines (be lenient in what
you accept and strict in what you generate). The
user-agent line can serve as an implicit separator between
one record and the next.
Skip
-------------------------------------------------------
Date: 2001-Jan-04 15:51
By: calvin
Comment:
Changes:
- global debug variable in the test function
- redirection now works
- accidently printed "Allow" when I meant "Disallow". This has been
fixed.
It parses the Musi-Cal robots.txt file correctly, but the robots.txt file
has syntax errors:
before each user-agent: line there has to be one or more empty lines.
-------------------------------------------------------
Date: 2001-Jan-04 13:05
By: montanaro
Comment:
I apologize for taking so long to take a look at this.
I was reminded of it when I saw the switch from me to Guido.
I spent a little time fiddling with this module today. I'm
not satisfied that it works as advertised. Here are a
number of problems I found:
* in the test function, the debug variable is not
declared global, so setting it to 1 has no effect
* it never seemed to properly handle redirections, so it
never got from
http://www.musi-cal.com/robots.txt
to
http://musi-cal.mojam.com/robots.txt
* once I worked around the redirection problem it seemed
to parse the Musi-Cal robots.txt file incorrectly.
I replaced httplib with urllib in the read method and
got erroneous results. If you look at the above robots.txt
file you'll see that a bunch of email address harvesters
are explicitly forbidden (not that they pay attention to
robots.txt!). The following should print 0, but prints 1:
print rp.can_fetch('ExtractorPro',
'http://musi-cal.mojam.com/')
This is (at least in part) due to the fact that the
redirection never works. In the version I modified to
use urllib, it displays incorrect permissions for things like
ExtractorPro:
User-agent: ExtractorPro
Allow: /
Note that the lines in the robot.txt file for ExtractorPro
are actually
User-agent: ExtractorPro
Disallow: /
Skip
-------------------------------------------------------
Date: 2000-Nov-02 09:40
By: calvin
Comment:
I have written a new RobotParser module 'robotparser2.py'.
This module is
o backward compatible with the old one
o makes correct useragent matching (is buggy in
robotparser.py)
o strips comments correctly (is buggy in robotparser.py)
o uses httplib instead of urllib.urlopen() to catch HTTP
connect errors correctly (is buggy in robotparser.py)
o implements not only the draft at
http://info.webcrawler.com/mak/projects/robots/norobots.html
but also the new one at
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html
Bastian Kleineidam
-------------------------------------------------------
Date: 2000-Nov-02 11:14
By: gvanrossum
Comment:
Skip, can you comment on this?
-------------------------------------------------------
-------------------------------------------------------
For more info, visit:
http://sourceforge.net/patch/?func=detailpatch&patch_id=102229&group_id=5470