[issue3300] urllib.quote and unquote - Unicode issues
report at bugs.python.org
Sat Jul 12 19:05:57 CEST 2008
Matt Giuca <matt.giuca at gmail.com> added the comment:
So today I grepped for "urllib" in the entire library in an effort to
track down every dependency on quote and unquote to see exactly how my
patch breaks other code. I've now investigated every module in the
library which uses quote, unquote or urlencode, and my findings are
documented below in detail.
So far I have found no code "breakage" except for the original
email.util issue I fixed in patch 2. Of course that doesn't mean the
behaviour hasn't changed. Nearly all modules in the report below have
changed their behaviour so they used to deal with Latin-1-encoded URLs
and now deal with UTF-8-encoded URLs. As discussed at length above, I
see this as a positive change, since nearly everybody encodes URLs in
UTF-8, and of course it allows for all characters.
I also point out that the http.server module (unpatched) is internally
broken when dealing with filenames with characters outside range(0,256);
my patch fixes it.
I'm attaching patch 5, which adds a bunch of new test cases to various
modules which demonstrate those modules correctly handling UTF-8-encoded
URLs. It also fixes a bug in email.utils which I introduced in patch 2.
Note that I haven't yet fully investigated urllib.request.
Aside from that, the only remaining matter is whether or not it's better
to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that
question doesn't need debate.
So basically I think if there's support for it, this patch is just about
ready to be accepted. I'm hoping it can be included in the 3.0b2 release
I'd be glad to hear any feedback about this proposal.
Not Yet Investigated
By far the biggest user of quote and unquote.
username, password, hostname and paths are now all converted
to/from UTF-8 percent-encodings.
Other concerns are:
* Data in the form application/x-www-form-urlencoded
* FTP access
I think this needs to be tested further.
Looks fine, not tested
Just used to decode URI auth string (user:pass). This will change
to UTF-8, but is probably OK.
Just uses it in the HTTP handler to encode a dictionary. Probably
preferable to use UTF-8 to encode an arbitrary string.
Calls to urllib look broken. Not tested.
Tested manually, fine
Just used to set PATH_INFO, fine if URLs are UTF-8 encoded.
All uses are for translating between actual file-system paths to
URLs. This works fine for UTF-8 URLs. Note that since it uses
quote to create URLs in a dir listing, and unquote to handle
them, it breaks when unquote is not the inverse of quote.
Consider the following simple script:
s = http.server.HTTPServer(('',8000),
This will "kind of" work in the unpatched version, using
Latin-1 URLs, but filenames with characters above 256 will
break (give a 404 error).
The patch fixes this.
No test cases. Manually tested, URLs properly match when
percent-encoded in UTF-8.
No test cases available. Manually tested, fine if URLs are
Test cases either exist or added, fine
I wrote a large wad of test cases for all the new functionality.
Added test cases expecting UTF-8.
I changed a test case to expect UTF-8.
I changed this file to behave as it used to, to satisfy its
existing test cases.
Added test cases for UTF-8-encoded query strings.
urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be "safe".
Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.
Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
Added file: http://bugs.python.org/file10888/parse.py.patch5
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list