[ python-Feature Requests-1437699 ] allow unicode arguments for robotparser.can_fetch

Thu Apr 6 17:34:54 CEST 2006

Feature Requests item #1437699, was opened at 2006-02-23 16:07
Message generated for change (Comment added) made by osvenskan
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1437699&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: osvenskan (osvenskan)
Assigned to: Skip Montanaro (montanaro)
Summary: allow unicode arguments for robotparser.can_fetch

Initial Comment:
One-line summary: If the robotparser module encounters
a robots.txt file that contains non-ASCII characters
AND I pass a Unicode user agent string to can_fetch(),
that function crashes with a TypeError under Python
2.4. Under Python 2.3, the error is a UnicodeDecodeError. 

More detail:
When one calls can_fetch(MyUserAgent, url), the
robotparser module compares  the UserAgent to each user
agent described in the robots.txt file. If
isinstance(MyUserAgent, str) == True then the
comparison does not raise an error regardless of the
contents of robots.txt. However, if
isinstance(MyUserAgent, unicode) == True, then Python
implicitly tries to convert the contents of the
robots.txt file to Unicode before comparing it to
MyUserAgent. By default, Python assumes a US-ASCII
encoding when converting, so if the contents of
robots.txt aren't ASCII, the conversion fails. In other
words, this works:
MyRobotParser.can_fetch('foobot', url)
but this fails:
MyRobotParser.can_fetch(u'foobot', url)

I recreated this with Python 2.4.1 on FreeBSD 6 and
Python 2.3 under Darwin/OS X. I'll attach examples from
both. The URLs that I use in the attachments are from
my Web site and will remain live. They reference
robots.txt files which contain an umlaut-ed 'a' (0xe4
in iso-8859-1). They're served up using a special
.htaccess file that adds a Content-Type header which
correctly identifies the encoding used for each file.
Here's the contents of the .htaccess file:

AddCharset iso-8859-1 .iso8859-1
AddCharset utf-8 .utf8

A suggested solution:
AFAICT, the construction of robots.txt is still defined
by "a consensus on 30 June 1994 on the robots mailing
list" [http://www.robotstxt.org/wc/norobots.html] and a
1996 draft proposal
[http://www.robotstxt.org/wc/norobots-rfc.html] that
has never evolved into a formal standard. Neither of
these mention character sets or encodings which is no
surprise considering that they date back to the days
when the Internet was poor but happy and we considered
even ASCII a luxury and we were grateful to have it.
("ASCII? We used to dream of having ASCII. We only had
one bit, and it was a zero. We lived in a shoebox in
the middle of the road..." etc.) A backwards-compatible
yet forward-looking solution would be to have the
robotparser module respect the Content-Type header sent
with robots.txt. If no such header is present,
robotparser should try to decode it using iso-8859-1
per section 3.7.1 of the HTTP 1.1 spec
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1)
which says, 'When no explicit charset parameter is
provided by the sender, media subtypes of the "text"
type are defined to have a default charset value of
"ISO-8859-1" when received via HTTP. Data in character
sets other than "ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset value.' Section
3.6.1 of the HTTP 1.0 spec says the same. Since
ISO-8859-1 is a superset of US-ASCII, robots.txt files
that are pure ASCII won't be affected by the change.

----------------------------------------------------------------------

>Comment By: osvenskan (osvenskan)
Date: 2006-04-06 11:34

Message:
Logged In: YES 
user_id=1119995

I've also discovered that robotparser can get confused by
files with BOMs (byte order marks). At minimum it should
ignore BOMs, at best it should use them as clues as to the
file's encoding. It does neither, and instead treats the BOM
as character data. That's especially problematic when the
robots.txt file consists of this:
[BOM]User-agent: *
Disallow: / 

In that case, robotparser fails to recognize the string
"User-agent", so the disallow rule is ignored, which in turn
means it treats the file as empty and all robots are
permitted everywhere which is the exact opposite of what the
author intended. If the first line is a comment, then
robotparser doesn't get confused regardless of whether or
not there's a BOM.

I created a sample robots.txt file exactly like the one
above; it contains a utf-8 BOM. The example below uses this
file which is on my Web site.

>>> import robotparser
>>> rp=robotparser.RobotFileParser()
>>>
rp.set_url("http://semanchuk.com/philip/boneyard/robots/robots.txt.bom")
>>> rp.read()
>>> rp.can_fetch("foobot", "/")  # should return False
True
>>> 

My robot parser module doesn't suffer from the BOM bug
(although it doesn't use BOMs to decode the file, either,
which it really ought to). As I said before, You're welcome
to steal code from it or copy it wholesale (it is GPL).
Also, I'll be happy to open a different bug report if you
feel like this should be a separate issue.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-20 14:33

Message:
Logged In: YES 
user_id=38388

Reassigning to Skip: I don't use robotparser.

Skip, perhaps you can have a look ? (Didn't you write the
robotparser ?)

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-03-18 05:17

Message:
Logged In: YES 
user_id=849994

Turning into a Feature Request.

----------------------------------------------------------------------

Comment By: osvenskan (osvenskan)
Date: 2006-03-07 11:32

Message:
Logged In: YES 
user_id=1119995

Thanks for looking at this. I have some followup comments. 

The list at robotstxt.org is many years stale (note that
Google's bot is present only as Backrub which was still a
server at Stanford at the time:
http://www.robotstxt.org/wc/active/html/backrub.html) but
nevertheless AFAICT it is the most current bot list on the
Web. If you look carefully, the list *does* contain a
non-ASCII entry (#76 --easy to miss in that long list). That
Finnish bot is gone but it has left a legacy in the form of
many robots.txt files that were created by automated tools
based on the robotstxt.org list. Google helps us here:
http://www.google.com/search?q=allintext%3AH%C3%A4m%C3%A4h%C3%A4kki+disallow+filetype%3Atxt

And by Googling for some common non-ASCII words and letters
I can find more like this one (look at the end of the
alphabetical list):
http://paranormal.se/robots.txt

Robots.txt files that contain non-ASCII are few and far
between, it seems, but they're out there.

Which leads me to a nitpicky (but important!) point about
Unicode. As you point out, the spec doesn't mention Unicode;
it says nothing at all on the topic of encodings. My
argument is that just because the spec doesn't mention
encodings doesn't let us off the hook because the HTTP
1.0/1.1 specs are very clear that iso-8859-1, not US-ASCII,
is the default for text content delivered via HTTP. By my
interpretation, this means that the robots.txt examples
provided above are compliant with published specs, therefore
code that fails to interpret them does not comply. There's
no obvious need for robotparser to support full-blown
Unicode, just iso-8859-1. 

You might be interested in a replacement for this module
that I've implemented. It does everything that robotparser
does and also handles non-ASCII plus a few other things. It
is GPL; you're welcome to copy it in part or lock, stock and
barrel. So far I've only tested it "in the lab" but I've
done fairly extensive unit testing and I'll soon be testing
it on real-world data. The code and docs are here:
http://semanchuk.com/philip/boneyard/rerp/

Comments & feedback would be most welcome.

----------------------------------------------------------------------

Comment By: Terry J. Reedy (tjreedy)
Date: 2006-03-05 22:01

Message:
Logged In: YES 
user_id=593130

To me, this is not a bug report but at best an RFE.  The 
reported behavior is what I would expect.  I read both 
module doc and the referenced web page and further links.  
The doc does not mention Unicode as allowed and the 300 
registered UserAgents at
http://www.robotstxt.org/wc/active/html/index.html
all have ascii names.

So I recomment closing this as a bug report but will give 
ML a chance to respond.  If switched instead to Feature 
Request, I would think it would need some 'in the wild' 
evidence of need.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1437699&group_id=5470