[issue13281] robotparser.RobotFileParser ignores rules preceeded by a blank line

Petri Lehtinen report at bugs.python.org
Sat Oct 29 12:11:21 CEST 2011


Petri Lehtinen <petri at digip.org> added the comment:

> Because of the line break, clicking that link gives "Server error 404".

I don't see a line break, but the comma after the link seems to breaks it. Sorry.

> The way I read the grammar, 'records' (which start with an agent
> line) cannot have blank lines and must be separated by blank lines.

Ah, true. But it seems to me that having blank lines elsewhere doesn't break the parsing. If other robots.txt parser implementations allow arbitrary blank lines, we could add a strict=False parameter to make the parser non-strict. This would be a new feature of course.

Does the parser currently handle blank lines between full records (agentline(s) + ruleline(s)) correctly?

> I also do not see "Crawl-delay" and "Sitemap" (from whitehouse.gov) in the grammar referenced above. So I wonder if de facto practice has evolved.

The spec says:

   Lines with Fields not explicitly specified by this specification
   may occur in the /robots.txt, allowing for future extension of the
   format.

So these seem to be nonstandard extensions.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13281>
_______________________________________


More information about the Python-bugs-list mailing list