[Python-Dev] urllib.quote and unquote - Unicode issues

Thu Jul 31 08:36:30 CEST 2008

Matt Giuca writes:

 > OK, for all the people who say URI encoding does not encode characters: yes
 > it does. This is not an encoding for binary data, it's an encoding for
 > character data, but it's unspecified how the strings map to octets before
 > being percent-encoded.

In other words, it's an encoding for binary data, since the octet
sequences that might be encountered are completely unrestricted.  I
have to side with Bill on this.  URIs are sequences of characters, but
the character set used must contain the ASCII repertoire as a subset,
of which the URI delimiters must be mapped to the corresponding ASCII
codes, the rest of the set must be represented as sequences of octets
(which need not even be constant; you could gzip them first for all
URI-encoding cares).

URI-encoding itself is a purely mechanical process which transforms
reserved octets (not used as delimiters) to percent codes.

 > From RFC 3986, section
 > 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1>:

 > > Percent-encoded octets (Section 2.1) may be used within a URI to represent
 > > characters outside the range of the US-ASCII coded character set if this
 > > representation is allowed by the scheme or by the protocol element in which
 > > the URI is referenced.  Such a definition should specify the character
 > > encoding used to map those characters to octets prior to being
 > > percent-encoded for the URI.

This is kinda perverted, but suppose you have bytes which are actually a
Japanese string represented in packed EUC-JP.  AFAICS the paragraph above
does *not* say you can't transcode to UTF-8 before percent-encoding, and
in fact you might be required to by the definition of the scheme.

 > So the string->string proposal is actually correct behaviour.

Ye-e-es, but.  What the RFC clearly envisions is not that the
percent-encoder will be handed an unencoded string that looks like a
URI, but rather a sequence of octets representing one component
(scheme, authority, path, query, etc) of a URI.

In other words, a string->string URI encoder should only be called by
an URI builder, and never with a precomposed URI-like string.

Something like

def URIBuilder (strings):
    """Return an URI built from a list of strings.
    The first string *must* be the scheme.
    If the URI follows the generic URI syntax of RFC 3986, the
    remaining components should be given in the order authority, path,
    fragment, query part [, query part ...]."""

    def uriencode (s):
        """URI encode a string per RFC 3986 Section 3."""
        # We all know what this does.

    if strings[0] == "http":
        # HTTP scheme, delimiters and authority
        uri = "http://" + uriencode(strings[1]) + "/"
        # path, if present
        if strings[2]:
            uri = uri + uriencode(strings[2])
        # query, if present
        if  strings[4]:
            uri = uri + "?" + uriencode(strings[4])
        # further query parameters, if present
        for s in strings[4:]
            uri = uri + ";" + uriencode(s)
        # fragment, if present
        if strings[3]:
            uri = uri + "#" + uriencode(strings[3])
    else if strings[0] == "mailto":
        uri = "mailto:" + uriencode(strings[1])
    # etc etc

    return uri

I think you'd have a much easier time enforcing this pedantically
correct usage with a bytes->bytes encoder.

Of course, it's un-Pythonic to enforce pedantry, and we pedants can
use a string->string encoder correctly.

 > You really want me to remove the encoding= named argument? And hard-code
 > UTF-8 into these functions?

A quoting function that accepts bytes *must* have an encoding
argument.  There's no point to passing the quoter bytes unless the
text is represented in a non-Unicode encoding.