[ python-Bugs-1712522 ] urllib.quote throws exception on Unicode URL
SourceForge.net
noreply at sourceforge.net
Wed Jun 13 17:36:37 CEST 2007
Bugs item #1712522, was opened at 2007-05-04 06:11
Message generated for change (Comment added) made by varmaa
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Nagle (nagle)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib.quote throws exception on Unicode URL
Initial Comment:
The code in urllib.quote fails on Unicode input, when
called by robotparser with a Unicode URL.
Traceback (most recent call last):
File "./sitetruth/InfoSitePage.py", line 415, in run
pagetree = self.httpfetch() # fetch page
File "./sitetruth/InfoSitePage.py", line 368, in httpfetch
if not self.owner().checkrobotaccess(self.requestedurl) : # if access disallowed by robots.txt file
File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess
return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch
File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch
url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"
File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xe2'
That bit of code needs some attention.
- It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now.
- The initialization may not be thread-safe; a table is being initialized on first use.
"robotparser" was trying to check if a URL with a Unicode character in it was allowed. Note the "KeyError: u'\xe2'"
----------------------------------------------------------------------
Comment By: Atul Varma (varmaa)
Date: 2007-06-13 15:36
Message:
Logged In: YES
user_id=863202
Originator: NO
It should be noted that the unicode aspect of this bug is actually a
recognized flaw with a nontrivial solution. See this thread from the
Python-dev list, dated from July 2006:
http://mail.python.org/pipermail/python-dev/2006-July/067248.html
It was essentially agreed upon in this thread that the "obvious"
solution--simply converting to UTF-8 as per rfc3986--doesn't actually cover
all cases, and that passing a unicode string in to urllib.quote() indeed
has ambiguous results. For more information, see Mike Brown's comment on
the aforementioned thread:
http://mail.python.org/pipermail/python-dev/2006-July/067335.html
It was generally agreed in the thread that the proper solution was to have
urllib.quote() *only* deal with standard Python string data, and to raise a
TypeError if a unicode string is passed in, implying that any conversion
needs to be done by higher-level code, because implicit conversion within
urllib.quote() is too ambiguous.
However, it seems the TypeError fix was never made to the Python SVN
repository; perhaps this is because it may have broken legacy code that
actually catches KeyErrors as John Nagle mentioned? Or perhaps it was
simply because no one ever got around to it. Unfortunately, I'm not in a
position to say for sure, but I hope my explanation helps.
----------------------------------------------------------------------
Comment By: John Nagle (nagle)
Date: 2007-06-06 16:49
Message:
Logged In: YES
user_id=5571
Originator: YES
As a workaround, you can surround calls to "can_fetch" with an try-block
and catch KeyError exceptions. That's what I'm doing.
----------------------------------------------------------------------
Comment By: Collin Winter (collinwinter)
Date: 2007-06-05 23:39
Message:
Logged In: YES
user_id=1344176
Originator: NO
Could you possibly provide a patch to fix this?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470
More information about the Python-bugs-list
mailing list