[issue16099] robotparser doesn't support request rate and crawl delay parameters

Nikolay Bogoychev report at bugs.python.org
Tue Dec 10 01:22:52 CET 2013


Nikolay Bogoychev added the comment:

Thank you for the review!
I have addressed your comments and release a v2 of the patch:
Highlights:
 No longer crashes when provided with malformed crawl-delay/robots.txt parameter.
 Returns None when parameter is missing or syntax is invalid.
 Simplified several functions.
 Extended tests.

http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.rst
File Doc/library/urllib.robotparser.rst (right):

http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser....
Doc/library/urllib.robotparser.rst:56: .. method:: crawl_delay(useragent)
On 2013/12/09 03:30:54, berkerpeksag wrote:
> Is crawl_delay used for search engines? Google recommends you to set crawl speed
> via Google Webmaster Tools instead.
> 
> See https://support.google.com/webmasters/answer/48620?hl=en.
 
Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents.

http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py
File Lib/urllib/robotparser.py (right):

http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py#newco...
Lib/urllib/robotparser.py:168: for entry in self.entries:
On 2013/12/09 03:30:54, berkerpeksag wrote:
> Is there a better way to calculate this? (perhaps O(1)?)

I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the:
for entry in self.entries:
    if entry.applies_to(useragent):

logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer.

Thanks

Nick

----------
Added file: http://bugs.python.org/file33071/robotparser_v2.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue16099>
_______________________________________


More information about the Python-bugs-list mailing list