[Python-Dev] urllib.quote and unquote - Unicode issues
Stephen J. Turnbull
turnbull at sk.tsukuba.ac.jp
Thu Jul 31 08:36:30 CEST 2008
Matt Giuca writes:
> OK, for all the people who say URI encoding does not encode characters: yes
> it does. This is not an encoding for binary data, it's an encoding for
> character data, but it's unspecified how the strings map to octets before
> being percent-encoded.
In other words, it's an encoding for binary data, since the octet
sequences that might be encountered are completely unrestricted. I
have to side with Bill on this. URIs are sequences of characters, but
the character set used must contain the ASCII repertoire as a subset,
of which the URI delimiters must be mapped to the corresponding ASCII
codes, the rest of the set must be represented as sequences of octets
(which need not even be constant; you could gzip them first for all
URI-encoding cares).
URI-encoding itself is a purely mechanical process which transforms
reserved octets (not used as delimiters) to percent codes.
> From RFC 3986, section
> 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1>:
> > Percent-encoded octets (Section 2.1) may be used within a URI to represent
> > characters outside the range of the US-ASCII coded character set if this
> > representation is allowed by the scheme or by the protocol element in which
> > the URI is referenced. Such a definition should specify the character
> > encoding used to map those characters to octets prior to being
> > percent-encoded for the URI.
This is kinda perverted, but suppose you have bytes which are actually a
Japanese string represented in packed EUC-JP. AFAICS the paragraph above
does *not* say you can't transcode to UTF-8 before percent-encoding, and
in fact you might be required to by the definition of the scheme.
> So the string->string proposal is actually correct behaviour.
Ye-e-es, but. What the RFC clearly envisions is not that the
percent-encoder will be handed an unencoded string that looks like a
URI, but rather a sequence of octets representing one component
(scheme, authority, path, query, etc) of a URI.
In other words, a string->string URI encoder should only be called by
an URI builder, and never with a precomposed URI-like string.
Something like
def URIBuilder (strings):
"""Return an URI built from a list of strings.
The first string *must* be the scheme.
If the URI follows the generic URI syntax of RFC 3986, the
remaining components should be given in the order authority, path,
fragment, query part [, query part ...]."""
def uriencode (s):
"""URI encode a string per RFC 3986 Section 3."""
# We all know what this does.
if strings[0] == "http":
# HTTP scheme, delimiters and authority
uri = "http://" + uriencode(strings[1]) + "/"
# path, if present
if strings[2]:
uri = uri + uriencode(strings[2])
# query, if present
if strings[4]:
uri = uri + "?" + uriencode(strings[4])
# further query parameters, if present
for s in strings[4:]
uri = uri + ";" + uriencode(s)
# fragment, if present
if strings[3]:
uri = uri + "#" + uriencode(strings[3])
else if strings[0] == "mailto":
uri = "mailto:" + uriencode(strings[1])
# etc etc
return uri
I think you'd have a much easier time enforcing this pedantically
correct usage with a bytes->bytes encoder.
Of course, it's un-Pythonic to enforce pedantry, and we pedants can
use a string->string encoder correctly.
> You really want me to remove the encoding= named argument? And hard-code
> UTF-8 into these functions?
A quoting function that accepts bytes *must* have an encoding
argument. There's no point to passing the quoter bytes unless the
text is represented in a non-Unicode encoding.
More information about the Python-Dev
mailing list