[ python-Bugs-813986 ] robotparser interactively prompts for username and password

Thu Oct 6 18:19:48 CEST 2005

Bugs item #813986, was opened at 2003-09-28 15:06
Message generated for change (Settings changed) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
>Priority: 6
Submitted By: Erik Demaine (edemaine)
>Assigned to: Martin v. Löwis (loewis)
Summary: robotparser interactively prompts for username and password

Initial Comment:
This is a rare occurrence, but if a /robots.txt file is
password-protected on an http server, robotparser
interactively prompts (via raw_input) for a username
and password, because that is urllib's default
behavior.  One example of such a URL, at least at the
time of this writing, is

http://www.cosc.canterbury.ac.nz/robots.txt

Given that robotparser and robots.txt is all about
*robots* (not interactive users), I don't think this
interactive behavior is terribly appropriate.  Attached
is a simple patch to robotparser.py to fix this
behavior, forcing urllib to return the 401 error that
it ought to.

Another issue is whether a 401 (Authorization Required)
URL means that everything should be allowed or
everything should be disallowed.  I'm not sure what's
&quot;right&quot;.  Reading the spec, it says 'This file must be
accessible via HTTP on the local URL &quot;/robots.txt&quot;'
which I would read to mean it should be accessible
without username/password.  On the other hand, the
current robotparser.py code says &quot;if self.errcode ==
401 or self.errcode == 403: self.disallow_all = 1&quot;
which has the opposite effect.  I'll leave deciding
which is most appropriate to the powers that be.

----------------------------------------------------------------------

Comment By: Wummel (calvin)
Date: 2003-09-29 15:24

Message:
Logged In: YES 
user_id=9205

http://www.robotstxt.org/wc/norobots-rfc.html specifies the
401 and 403 return code consequences as restricting the
whole site (ie disallow_all = 1).

For the password input, the patch looks good to me. On the
long term, robotparser.py should switch to urllib2.py
anyway, and it should handle Transfer-Encoding: gzip.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470