[issue3300] urllib.quote and unquote - Unicode issues

Sat Jul 12 19:05:57 CEST 2008

Matt Giuca <matt.giuca at gmail.com> added the comment:

So today I grepped for "urllib" in the entire library in an effort to
track down every dependency on quote and unquote to see exactly how my
patch breaks other code. I've now investigated every module in the
library which uses quote, unquote or urlencode, and my findings are
documented below in detail.

So far I have found no code "breakage" except for the original
email.util issue I fixed in patch 2. Of course that doesn't mean the
behaviour hasn't changed. Nearly all modules in the report below have
changed their behaviour so they used to deal with Latin-1-encoded URLs
and now deal with UTF-8-encoded URLs. As discussed at length above, I
see this as a positive change, since nearly everybody encodes URLs in
UTF-8, and of course it allows for all characters.

I also point out that the http.server module (unpatched) is internally
broken when dealing with filenames with characters outside range(0,256);
my patch fixes it.

I'm attaching patch 5, which adds a bunch of new test cases to various
modules which demonstrate those modules correctly handling UTF-8-encoded
URLs. It also fixes a bug in email.utils which I introduced in patch 2.

Note that I haven't yet fully investigated urllib.request.

Aside from that, the only remaining matter is whether or not it's better
to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that
question doesn't need debate.

So basically I think if there's support for it, this patch is just about
ready to be accepted. I'm hoping it can be included in the 3.0b2 release
next week.

I'd be glad to hear any feedback about this proposal.

Not Yet Investigated
--------------------

./urllib/request.py
    By far the biggest user of quote and unquote.
    username, password, hostname and paths are now all converted
    to/from UTF-8 percent-encodings.
    Other concerns are:
        * Data in the form application/x-www-form-urlencoded
        * FTP access
    I think this needs to be tested further.

Looks fine, not tested
----------------------

./xmlrpc/client.py
    Just used to decode URI auth string (user:pass). This will change
    to UTF-8, but is probably OK.
./logging/handlers.py
    Just uses it in the HTTP handler to encode a dictionary. Probably
    preferable to use UTF-8 to encode an arbitrary string.
./macurl2path.py
    Calls to urllib look broken. Not tested.

Tested manually, fine
---------------------

./wsgiref/simple_server.py
    Just used to set PATH_INFO, fine if URLs are UTF-8 encoded.
./http/server.py
    All uses are for translating between actual file-system paths to
    URLs. This works fine for UTF-8 URLs. Note that since it uses
    quote to create URLs in a dir listing, and unquote to handle
    them, it breaks when unquote is not the inverse of quote.

    Consider the following simple script:

    import http.server
    s = http.server.HTTPServer(('',8000),
            http.server.SimpleHTTPRequestHandler)
    s.serve_forever()

    This will "kind of" work in the unpatched version, using
    Latin-1 URLs, but filenames with characters above 256 will
    break (give a 404 error).
    The patch fixes this.
./urllib/robotparser.py
    No test cases. Manually tested, URLs properly match when
    percent-encoded in UTF-8.
./nturl2path.py
    No test cases available. Manually tested, fine if URLs are
    UTF-8 encoded.

Test cases either exist or added, fine
--------------------------------------

./test/test_urllib.py
    I wrote a large wad of test cases for all the new functionality.
./wsgiref/util.py
    Added test cases expecting UTF-8.
./http/cookiejar.py
    I changed a test case to expect UTF-8.
./email/utils.py
    I changed this file to behave as it used to, to satisfy its
    existing test cases.
./cgi.py
    Added test cases for UTF-8-encoded query strings.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be "safe".

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

Added file: http://bugs.python.org/file10888/parse.py.patch5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3300>
_______________________________________