[issue2244] urllib and urllib2 decode userinfo multiple times

Carl Meyer report at bugs.python.org
Thu Mar 6 17:10:10 CET 2008


New submission from Carl Meyer:

Both urllib and urllib2 call urllib.unquote() multiple times on data in
the userinfo section of an FTP URL.  One call occurs at the end of the
urllib.splituser() function.  In urllib, the other call appears in
URLOpener.open_ftp().  In urllib2, the other two occur in
FTPHandler.ftp_open() and Request.get_host().

The effect of this is that if the userinfo section of an FTP url should
need to contain a literal % sign followed by two digits, the % sign must
be double-encoded as %2525 (for urllib) or triple-encoded as %252525
(for urllib2) in order for the URL to be accessed.

The proper behavior would be to only ever unquote a given data segment
once.  The W3's URI: Generic Syntax RFC
(http://gbiv.com/protocols/uri/rfc/rfc3986.html) addresses this very
issue in section 2.4 (When to Encode or Decode): "Implementations must
not percent-encode or decode the same string more than once, as decoding
an already decoded string might lead to misinterpreting a percent data
octet as the beginning of a percent-encoding, or vice versa in the case
of percent-encoding an already percent-encoded string."

The solution would be to standardize where in urllib and urllib2 the
unquoting happens, and then make sure it happens nowhere else.  I'm not
familiar enough with the libraries to know where it should be removed
without possibly breaking other behavior.  It seems that just removing
the map/unquote call in urllib.splituser() would fix the problem in
urllib.  I would guess the call in urllib2 Request.get_host() should
also be removed, as the RFC referenced above says clearly that only
individual data segments of the URL should be decoded, not larger
portions that might contain delimiters (: and @).

I've attached a patchset for these suggested changes.  Very superficial
testing suggests that the patch doesn't break anything obvious, but I
make no guarantees.

----------
components: Library (Lib)
files: urllib-issue.patch
keywords: patch
messages: 63324
nosy: carljm
severity: normal
status: open
title: urllib and urllib2 decode userinfo multiple times
type: behavior
versions: Python 2.5
Added file: http://bugs.python.org/file9621/urllib-issue.patch

__________________________________
Tracker <report at bugs.python.org>
<http://bugs.python.org/issue2244>
__________________________________


More information about the Python-bugs-list mailing list