[Python-Dev] urllib.quote and unquote - Unicode issues

Matt Giuca matt.giuca at gmail.com
Sun Jul 13 01:15:18 CEST 2008


Thanks for all the replies, and making me feel welcome :)

>
> If what you are saying is true, then it can probably go in as a bug
> fix (unless someone else knows something about Latin-1 on the Net that
> makes this not true).
>

Well from what I've seen, the only time Latin-1 naturally appears on the net
is when you have a web page in Latin-1 (either explicit or inferred; and
note that a browser like Firefox will infer Latin-1 if it sees only ASCII
characters) with a form in it. Submitting the form, the browser will use
Latin-1 to percent-encode the query string.

So if you write a web app and you don't have any non-ASCII characters or
mention the charset, chances are you'll get Latin-1. But I would argue
you're leaving things to chance and you deserve to get funny behaviour. If
you do any of the following:

   - Use a non-ASCII character, encoded as UTF-8 on the page.
   - Send a Content-Type: xxxx; charset=utf-8.
   - In HTML, set a <meta http-equiv="Content-Type: xxxx; charset=utf-8" />.
   - In the form itself, set <form accept-encoding="utf-8">.

then the browser will encode the form data as UTF-8. And most "proper" web
pages should get themselves explicitly served as UTF-8.

That I can't say I can necessarily due; have my own bug reports to
> work through this weekend. =)


OK well I'm busy for the next few days; after that I can do a patch trade
with someone. (That is if I am allowed to do reviews; not sure since I don't
have developer privileges).


On Sun, Jul 13, 2008 at 5:58 AM, Mark Hammond <mhammond at skippinet.com.au>
wrote:

> > My first post to the list. In fact, first time Python hacker,
> > long-time Python user though. (Melbourne, Australia).
>
> Cool - where exactly?  I'm in Wantirna (although not at this very moment -
> I'm in Lithuania, but home again in a couple of days)


Cool :) Balwyn.


> * Please take Martin with a grain of salt ( \I would say "ignore him", but
> that is too strong ;)


Lol, he is a hard man to please, but he's given some good feedback.


On Sun, Jul 13, 2008 at 7:07 AM, Bill Janssen <janssen at parc.com> wrote:

>
> The standard here is RFC 3986, from Jan 2005, which says,
>
>  ``When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set [UCS],
> the data should first be encoded as octets according to the UTF-8
> character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be
> percent-encoded.''


Ah yes, I was originally hung up on the idea that "URLs had to be encoded in
UTF-8", till Martin pointed out that it only says "new URI scheme" there.
It's perfectly valid to have non-UTF-8-encoded URIs. However in practice
they're almost always UTF-8. So I think introducing the new encoding
argument and having it default to "utf-8" is quite reasonable.

I'd say, treat the incoming data as either Unicode (if it's a Unicode
> string), or some unknown superset of ASCII (which includes both
> Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown
> encoding), and apply the appropriate transformation.
>

Ah there may be some confusion here. We're only dealing with str->str
transformations (which in Python 3 means Unicode strings). You can't put a
bytes in or get a bytes out of either of these functions. I suggested a
"quote_raw" and "unquote_raw" function which would let you do this.

The issue is with the percent-encoded characters in the URI string, which
must be interpreted as bytes, not code points. How then do you convert these
into a Unicode string? (Python 2 did not have this problem, since you simply
output a byte string without caring about the encoding).

On Sun, Jul 13, 2008 at 9:10 AM, "Martin v. Löwis" <martin at v.loewis.de>
wrote:

> > Very nice, I had this somewhere on my todo list to work on. I'm very much
> > in favour, especially since it synchronizes us with the RFCs (for all I
> > remember reading about it last time).
>
> I still think that it doesn't. The RFCs haven't changed, and can't
> change for compatibility reasons. The encoding of non-ASCII characters
> in URLs remains as underspecified as it always was.


Correct. But my patch brings us in-line with that unspecification. The
unpatched version forces you to use Latin-1. My patch lets you specify the
encoding to use.


> Now, with IRIs, the situation is different, but I don't think the patch
> claims to implement IRIs (and if so, it perhaps shouldn't change URL
> processing in doing so).


True. I don't claim to have implemented IRIs or even know enough about them
to do that. I'll read up on these things in the next few days.

However, this is a URI library, not IRI. From what I've seen, it's
percent-encoded URIs coming in from the browser, not IRIs. We just need to
make sure with this patch that IRIs don't become less-supported than they
were before; don't need to explicitly support them.

Cheers,
Matt Giuca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20080713/6a007f88/attachment-0001.htm>


More information about the Python-Dev mailing list