Re: [Python-Dev] urllib.quote and unquote - Unicode issues

On Wed, Jul 30, 2008 at 12:49 PM, Bill Janssen <janssen@parc.com> wrote:
unquote() -- takes string, produces bytes or string
If optional "encoding" parameter is specified, decodes bytes with that encoding and returns string. Otherwise, returns bytes.
The default of returning bytes will break almost all uses. Most code will uses the unquoted result as a text string, not as bytes -- e.g. a server has to unquote the values it receives from a form (whether POST or GET), but almost always the unquoted values are text, e.g. someone's name or address, or a draft email message.
I actually do know a lot about the uses of this function...
But: OK, OK, I yield. Though I still think this is a bad idea, I'll shut up if we can also add "unquote_as_bytes" which returns a byte sequence instead of a string. I'll just change my code to use that.
(Aside: I dislike functions that have a different return type based on the value of a parameter.)
Fair enough.
I think this is as close as consensus as we can get on this issue. Can whoever wrote the patch adjust the patch to this outcome? (I think the only change is to remove the encoding arguments and make separate functions for bytes.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

I think this is as close as consensus as we can get on this issue. Can whoever wrote the patch adjust the patch to this outcome? (I think the only change is to remove the encoding arguments and make separate functions for bytes.)
This is 2.7/3.1 only, right? I'm looking at the bales of code I've got that says something like, v = urlib.quote_plus(x.encode("UTF-8", "strict")) then later on x = unicode(urllib.unquote_plus(v), "UTF-8", "strict") Bill

Con: URI encoding does not encode characters.
OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded. From RFC 3986, section 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1> : Percent-encoded octets (Section 2.1) may be used within a URI to represent
characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI.
So the string->string proposal is actually correct behaviour. I'm all in favour of a bytes->string version as well, just not with the names "quote" and "unquote". I'll prepare a new patch shortly which has bytes->string and string->bytes versions of the functions as well. (quote will accept either type, while unquote will output a str, there will be a new function unquote_to_bytes which outputs a bytes - is everyone happy with that?) Guido says:
Actually, we'd need to look at the various other APIs in Py3k before we can decide whether these should be considered taking or returning bytes or text. It looks like all other APIs in the Py3k version of urllib treat URLs as text.
Yes, as I said in the bug tracker, I've groveled over the entire stdlib to see how my patch affects the behaviour of dependent code. Aside from a few minor bits which assumed octets (and did their own encoding/decoding) (which I fixed), all the code assumes strings and is very happy to go on assuming this, as long as the URIs are encoded with UTF-8, which they almost certainly are. Guido says:
I think the only change is to remove the encoding arguments and ...
You really want me to remove the encoding= named argument? And hard-code UTF-8 into these functions? It seems like we may as well have the optional encoding argument, as it does no harm and could be of significant benefit. I'll post a patch with the unquote_to_bytes function, but leave the encoding arguments in until this point is clarified. Matt

On Wed, Jul 30, 2008 at 8:49 PM, Matt Giuca <matt.giuca@gmail.com> wrote:
Con: URI encoding does not encode characters.
OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded. From RFC 3986, section 1.2.1:
Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI.
So the string->string proposal is actually correct behaviour. I'm all in favour of a bytes->string version as well, just not with the names "quote" and "unquote".
I'll prepare a new patch shortly which has bytes->string and string->bytes versions of the functions as well. (quote will accept either type, while unquote will output a str, there will be a new function unquote_to_bytes which outputs a bytes - is everyone happy with that?)
I'd rather have two pairs of functions, so that those who want to give the readers of their code a clue can do so. I'm not opposed to having redundant functions that accept either string or bytes though, unless others prefer not to.
Guido says:
Actually, we'd need to look at the various other APIs in Py3k before we can decide whether these should be considered taking or returning bytes or text. It looks like all other APIs in the Py3k version of urllib treat URLs as text.
Yes, as I said in the bug tracker, I've groveled over the entire stdlib to see how my patch affects the behaviour of dependent code. Aside from a few minor bits which assumed octets (and did their own encoding/decoding) (which I fixed), all the code assumes strings and is very happy to go on assuming this, as long as the URIs are encoded with UTF-8, which they almost certainly are.
Sorry, I have yet to look at the tracker (only so many minutes in a day...).
Guido says:
I think the only change is to remove the encoding arguments and ...
You really want me to remove the encoding= named argument? And hard-code UTF-8 into these functions? It seems like we may as well have the optional encoding argument, as it does no harm and could be of significant benefit. I'll post a patch with the unquote_to_bytes function, but leave the encoding arguments in until this point is clarified.
I don't mind an encoding argument, as long as it isn't used to change the return type (as Bill was proposing). -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Alright, I've uploaded the new patch which adds the two requested bytes-oriented functions, as well as accompanying docs and tests. http://bugs.python.org/issue3300 http://bugs.python.org/file11009/parse.py.patch6 I'd rather have two pairs of functions, so that those who want to give
the readers of their code a clue can do so. I'm not opposed to having redundant functions that accept either string or bytes though, unless others prefer not to.
Yes, I was in a similar mindset. So the way I've implemented it, quote accepts either a bytes or a str. Then there's a new function quote_from_bytes, which is defined precisely like this: quote_from_bytes = quote
So either name can be used on either input type, with the idea being that you should use quote on a str, and quote_from_bytes on a bytes. Is this a good idea or should it be rewritten so each function permits only one input type? Sorry, I have yet to look at the tracker (only so many minutes in a day...). Ah, I didn't mean offense. Just that one could read the sordid details of my investigation on the tracker if one so desired ;) I don't mind an encoding argument, as long as it isn't used to change
the return type (as Bill was proposing).
Yeah, my unquote always outputs a str, and unquote_to_bytes always outputs a bytes. Matt

quote_from_bytes = quote
So either name can be used on either input type, with the idea being that you should use quote on a str, and quote_from_bytes on a bytes. Is this a good idea or should it be rewritten so each function permits only one input type?
so you can use quote_from_bytes on strings? I assumed Guido meant it was okay to have quote accept string/byte input and have a function that was redundant but limited in what it accepted (i.e. quote_from_bytes accepts only bytes) I suppose your implementation doesn't break anything... it just strikes me as "odd"

so you can use quote_from_bytes on strings?
Yes, currently.
I assumed Guido meant it was okay to have quote accept string/byte input and have a function that was redundant but limited in what it accepted (i.e. quote_from_bytes accepts only bytes)
I suppose your implementation doesn't break anything... it just strikes me as "odd"
Yeah. I get exactly what you mean. Worse is it takes an encoding/replace argument. I'm in two minds about whether it should allow this or not. On one hand, it kind of goes with the Python philosophy of not artificially restricting the allowed types. And it avoids redundancy in the code. But I'd be quite happy to let quote_from_bytes restrict its input to just bytes, to avoid confusion. It would basically be a slightly-modified version of quote: def quote_from_bytes(s, safe = '/'): if isinstance(safe, str): safe = safe.encode('ascii', 'ignore') cachekey = (safe, always_safe) if not isinstance(s, bytes) or isinstance(s, bytearray): raise TypeError("quote_from_bytes() expected a bytes") try: quoter = _safe_quoters[cachekey] except KeyError: quoter = Quoter(safe) _safe_quoters[cachekey] = quoter res = map(quoter, s) return ''.join(res) (Passes test suite). I think I'm happier with this option. But the "if not isinstance(s, bytes) or isinstance(s, bytearray)" is not very nice. (The only difference to quote besides the missing arguments is the two lines beginning "if not isinstance". Maybe we can generalise the rest of the function).

Has anyone had time to look at the patch for this issue? It got a lot of support about a week ago, but nobody has replied since then, and the patch still hasn't been assigned to anybody or given a priority. I hope I've complied with all the patch submission procedures. Please let me know if there is anything I can do to speed this along. Also I'd be interested in hearing anyone's opinion on the "quote_from_bytes" issue as raised in the previous email. I posted a suggested implementation of a more restrictive quote_from_bytes in that email, but I haven't included it in the patch yet. Matt Giuca

After the most recent flurry of discussion I've lost track of what's the right thing to do. I also believe it was said it should wait until 2.7/3.0, so there's no hurry (in fact there's no way to check it -- we don't have branches for those versions yet). On Tue, Aug 5, 2008 at 5:47 AM, Matt Giuca <matt.giuca@gmail.com> wrote:
Has anyone had time to look at the patch for this issue? It got a lot of support about a week ago, but nobody has replied since then, and the patch still hasn't been assigned to anybody or given a priority.
I hope I've complied with all the patch submission procedures. Please let me know if there is anything I can do to speed this along.
Also I'd be interested in hearing anyone's opinion on the "quote_from_bytes" issue as raised in the previous email. I posted a suggested implementation of a more restrictive quote_from_bytes in that email, but I haven't included it in the patch yet.
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

After the most recent flurry of discussion I've lost track of what's the right thing to do. I also believe it was said it should wait until 2.7/3.0, so there's no hurry (in fact there's no way to check it -- we don't have branches for those versions yet).
I assume you mean 2.7/3.1. I've always been concerned with the suggestion that this wait till 3.1. I figure this patch is going to change the documented behaviour of these functions, so it might be unacceptable to change it after 3.0 is released. It seems logical that this patch be part of the "incompatible-for-the-sake-of-fixing-things" set of changes in 3.0. The current behaviour is broken. Any code which uses quote to produce a URL, then unquotes the same URL later will simply break for characters outside the Latin-1 range. This is evident in the SimpleHTTPServer class as I said above (which presents users with URLs for the files in a directory using quote, then gives 404 when they click on them, because unquote can't handle it). And it will break any user's code which also assumes unquote is the inverse of quote. We could hack a fix into SimpleHTTPServer and expect other users to do the same (along the lines of .encode('utf-8').decode('latin-1')), but then those hacks will break when we apply the patch in 3.1 because they abuse Unicode strings, and we'll have to have another debate about how to be backwards compatible with them. (The patched version is largely compatible with the 2.x version, but the unpatched version isn't compatible with either the 2.x version or the patched version). Surely the sane option is to get this UTF-8 patch into version 3.0 so we don't have to support this bug into the future? I'm far less concerned about the decision with regards to unquote_to_bytes/quote_from_bytes, as those are new features which can wait. Matt Giuca

* Bill Janssen wrote:
I'm far less concerned about the decision with regards to unquote_to_bytes/quote_from_bytes, as those are new features which can wait.
Forgive me, but those are the *old* features, which must be there.
This whole discussion circles too much, I think. Maybe it should be pepped? nd

This whole discussion circles too much, I think. Maybe it should be pepped?
The issue isn't circular. It's been patched and tested, then a whole lot of people agreed including Guido. Then you and Bill wanted the bytes functionality back. So I wrote that in there too, and Bill at least said that was sufficient. On Thu, Jul 31, 2008, Bill Janssen wrote:
But: OK, OK, I yield. Though I still think this is a bad idea, I'll shut up if we can also add "unquote_as_bytes" which returns a byte sequence instead of a string. I'll just change my code to use that.
We've reached, to quote Guido, "as close as consensus as we can get on this issue". There is a bug in Python. I've proposed a working fix, and nobody else has. Guido okayed it. I made all the changes the community suggested. What more needs to be discussed here?

* Matt Giuca wrote:
This whole discussion circles too much, I think. Maybe it should be pepped?
The issue isn't circular. It's been patched and tested, then a whole lot of people agreed including Guido. Then you and Bill wanted the bytes functionality back. So I wrote that in there too, and Bill at least said that was sufficient.
On Thu, Jul 31, 2008, Bill Janssen wrote:
But: OK, OK, I yield. Though I still think this is a bad idea, I'll shut up if we can also add "unquote_as_bytes" which returns a byte sequence instead of a string. I'll just change my code to use that.
We've reached, to quote Guido, "as close as consensus as we can get on this issue".
There are a lot of quotes around. Including "After the most recent flurry of discussion I've lost track of what's the right thing to do." But I don't talk for other people.
There is a bug in Python. I've proposed a working fix, and nobody else has.
Well, you proposed a patch ;-) It may fix things, it will break a lot. While this was denied over and over again, it's still gonna happen, because the axioms are still not accounting for the reality.
I made all the changes the community suggested.
I don't think so.
What more needs to be discussed here?
Huh? You feel, the discussion is over? Then why are there still open questions? I admit, a lot of discussion is triggered by the assessments you're stating in your posts. Don't take it as a personal offense, it's a simple observation. There were made a lot of statements and nobody even bothered to substantiate them. A PEP could fix that. But it's a lost issue now. Nobody comes up with an alternative (for various reasons, I suppose). So go ahead, EOD from my side. nd

André Malo wrote:
* Matt Giuca wrote:
We've reached, to quote Guido, "as close as consensus as we can get on this issue".
There are a lot of quotes around. Including "After the most recent flurry of discussion I've lost track of what's the right thing to do." But I don't talk for other people.
Let's not play the who-can-find-the-best-quote-to-make-their-point game. Yes, there was a lot of discussion, and yes, it would be difficult to follow if it wasn't something you were paying much attention to. I believe (as someone who didn't even participate in the discussion) that it was clear that: * quote/unquote should "just work" for strings in 3.x. Meaning that quote should be str->str and unquote str->str, because *most* uses of quote/unquote are naive. There is a quite clear proclamation from the BDFL that quote/unquote should not be bound by pedantic readings of the RFCs. * an alternative set of functions that support byte->str and str->byte should be added to support other use-cases that are less naive. Matt added unquote_to_bytes/quote_from_bytes to fill this gap. Bill agreed it was sufficient to satisfy his use-cases (stipulating that they are a necessary addition if any change should be made at all).
There is a bug in Python. I've proposed a working fix, and nobody else has.
Well, you proposed a patch ;-) It may fix things, it will break a lot. While this was denied over and over again, it's still gonna happen, because the axioms are still not accounting for the reality.
I've not read anyone other than Bill come forward saying they had a lot of code that uses quote/unquote that will be broke. Matt has went through the stdlib and found most uses consistent with the "quote/unquote users are naive" statement. I would suggest that the onus is on you to substantiate this claim that "it will break a lot."
I made all the changes the community suggested.
I don't think so.
This short reply is useless. What are those changes? If your problem is that *your* suggestions were dropped, then I remind you that they *were discussed*. And, Matt correctly pointed out that setting the defaults to encoding in latin-1 and decoding in utf-8 would be a nightmare. It would be much more sane to pick one encoding for both. However, which encoding it should be is an arguable point. The apparent consensus was that most people didn't care as long as they could override it.
What more needs to be discussed here?
Huh? You feel, the discussion is over?
Can we please avoid discussions about discussion? Arguing about arguing does not benefit this discussion. If you have a problem with his proposed patch, then please elaborate on that rather than /just/ complain that it is unsatisfactory in some way. Do you agree there is a bug? Do you agree it needs to be solved? And, what about the proposed solution is unsatisfactory? -Scott -- Scott Dial scott@scottdial.com scodial@cs.indiana.edu

I suggest we continue this discussion, if at all, on the bug-tracker, where there's code, and more participants. http://bugs.python.org/issue3300 I've now posted my idea of how quote/unquote should work in py3K, there. Bill

There are a lot of quotes around. Including "After the most recent flurry of discussion I've lost track of what's the right thing to do." But I don't talk for other people.
OK .. let me compose myself a little. Sorry I went ahead and assumed this was closed. It's just frustrating to me that I've now spent a month trying to push this through, and while it seems everybody has an opinion, nobody seems to have bothered trying my code. (I've even implemented your suggestions and posted my feedback, and nobody replied to that). Nobody's been assigned to look at it and it hasn't been given a priority, even though we all agree it's a bug (though we disagree on how to fix it).
There is a bug in Python. I've proposed a working fix, and nobody else has.
Well, you proposed a patch ;-) It may fix things, it will break a lot. While this was denied over and over again, it's still gonna happen, because the axioms are still not accounting for the reality.
Well all you're getting from me is "it works". And all I'm getting from you is "it might not". Please .. I've been asking for weeks now for someone to review the patch. I've already spent hours (like ... days worth of hours) testing this patch against the whole library. I've written reams of reports on the tracker to try and convince people it works. There isn't any more *I* can do. If you think it's going to break code, go ahead and try it out. The claims I am making are based on my experience working with a) Python 2, b) Python 3 as it stands, c) Python 3 with my patch, and d) Python 3 with quote/unquote using bytes. In my experience, (c) is the only version of Python 3 which works properly.
I made all the changes the community suggested.
I don't think so.
?
What more needs to be discussed here?
Huh? You feel, the discussion is over? Then why are there still open questions? I admit, a lot of discussion is triggered by the assessments you're stating in your posts. Don't take it as a personal offense, it's a simple observation. There were made a lot of statements and nobody even bothered to substantiate them.
If you read the bug tracker <http://bugs.python.org/issue3300> all the way to the beginning, you'll see I use a good many examples, and I also went through the entire standard library <http://bugs.python.org/msg69591> to try and substantiate my claims. (Admittedly, I didn't finish investigating the request module, but that shouldn't prevent the patch from being reviewed). As I've said all along, yes, it will break code, but then *all solutions possible* will break code, including leaving it in. Mine *seems* to break the least existing code. If there is ever a time to break code, Python 3.0 is it.
A PEP could fix that.
I could write a PEP. But as you've read above, I'm concerned this won't get into Python 3.0, and then we'll be locked into the existing functionality and it'll never get accepted; hence I'd rather this be resolved as quickly as possible. If you think it's worth writing a PEP, let's do it. Apologies again for my antagonistic reply earlier. Not trying to step on toes, just get stuff done. Matt

Nobody's been assigned to look at it and it hasn't been given a priority, even though we all agree it's a bug (though we disagree on how to fix it).
This I can explain (I think). Nobody is assigned to look: we usually don't do assignments of bugs or patches, except when there is a specific maintainer for the code in question. urllib has no maintainer. It hasn't been given priority: There are currently 606 patches in the tracker, many fixing bugs of some sort. It's not clear (to me, at least) why this should be given priority over all the other things such as interpreter crashes. We all agree it's a bug: no, I don't. I think it's a missing feature, at best, but I'm staying out of the discussion. As-is, urllib only supports ASCII in URLs, and that is fine for most purposes. URLs are just not made for non-ASCII characters. Implement IRIs if you want non-ASCII characters; the rules are much clearer for these. As it stands, a committer would have - to agree it's an important problem - to agree the patch is correct - to judge it is not a new feature, as we are in beta already - implicitly accept maintenance of that change, and take all the blame that it might produce in the coming years Regards, Martin

Martin v. Löwis <martin <at> v.loewis.de> writes:
URLs are just not made for non-ASCII characters.
Perhaps they are not, but every non-English wiki (just to take a simple, generic example) potentially contains non-ASCII URLs. e.g. http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant http://wiki.python.org/moin/J%C3%BCrgenHermann (notice the utf-8 encoding in both)
Implement IRIs if you want non-ASCII characters; the rules are much clearer for these.
I think most people would expect something which works with the current World Wide Web rather than a rigorous implementation of a specific RFC. Implementing RFCs is fine but it does not magically eliminate all problems, especially when the RFCs themselves are not in sync with real-world usage. Regards Antoine.

Implement IRIs if you want non-ASCII characters; the rules are much clearer for these.
I think most people would expect something which works with the current World Wide Web rather than a rigorous implementation of a specific RFC. Implementing RFCs is fine but it does not magically eliminate all problems, especially when the RFCs themselves are not in sync with real-world usage.
Why do you think implementing IRIs would not work with the current World Wide Web? IRIs are *made* for the current World Wide Web. Applications desiring to work in the current World Wide Web would then, of course, need to use the IRI library. Regards, Martin

On 2008-08-06 18:55, Antoine Pitrou wrote:
Martin v. Löwis <martin <at> v.loewis.de> writes:
URLs are just not made for non-ASCII characters.
Perhaps they are not, but every non-English wiki (just to take a simple, generic example) potentially contains non-ASCII URLs. e.g. http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant http://wiki.python.org/moin/J%C3%BCrgenHermann (notice the utf-8 encoding in both)
Implement IRIs if you want non-ASCII characters; the rules are much clearer for these.
I think most people would expect something which works with the current World Wide Web rather than a rigorous implementation of a specific RFC. Implementing RFCs is fine but it does not magically eliminate all problems, especially when the RFCs themselves are not in sync with real-world usage.
+1. Practicality beats purity... The web is moving towards UTF-8 as standard Unicode encoding, so it's probably wise to follow that approach for quote(). http://en.wikipedia.org/wiki/Percent-encoding The other way around will also have to deal with old-style URLs which typically still use the Latin-1 encoding which was the basis for HTML: http://www.w3schools.com/TAGS/ref_urlencode.asp So unquote() should probably try to decode using UTF-8 first and then fall back to Latin-1 if that doesn't work. Whether the result of quote()/unquote() should be bytes or Unicode is a different story and probably also depends on what the application does with the result. I don't think there's a good general answer for that one, except maybe just going for one possible combination and document that. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 05 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

On Wed, Aug 6, 2008 at 9:09 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Nobody's been assigned to look at it and it hasn't been given a priority, even though we all agree it's a bug (though we disagree on how to fix it).
This I can explain (I think). Nobody is assigned to look: we usually don't do assignments of bugs or patches, except when there is a specific maintainer for the code in question. urllib has no maintainer.
I'm somehow strangely attracted to this issue, and have posted a bit of a code review.
It hasn't been given priority: There are currently 606 patches in the tracker, many fixing bugs of some sort. It's not clear (to me, at least) why this should be given priority over all the other things such as interpreter crashes.
Well, it's an API change, and those are impossible after beta3 is released (or at least have to wait for 3.1).
We all agree it's a bug: no, I don't. I think it's a missing feature, at best, but I'm staying out of the discussion. As-is, urllib only supports ASCII in URLs, and that is fine for most purposes. URLs are just not made for non-ASCII characters. Implement IRIs if you want non-ASCII characters; the rules are much clearer for these.
The wikipedia use of urlencoded UTF-8 (and other examples) suggest that whether we like it or not the trend is towards this. I'd like to support such usage rather than fight it (standards or no standards).
As it stands, a committer would have - to agree it's an important problem - to agree the patch is correct - to judge it is not a new feature, as we are in beta already - implicitly accept maintenance of that change, and take all the blame that it might produce in the coming years
It could be a small enough new feature. Also note that for urls that only use ASCII there is no behavior change. (And certainly this *is* an important use case, e.g. to encode occurrences of + or & in query strings). -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Wow .. a lot of replies today! On Thu, Aug 7, 2008 at 2:09 AM, "Martin v. Löwis" <martin@v.loewis.de>wrote:
It hasn't been given priority: There are currently 606 patches in the tracker, many fixing bugs of some sort. It's not clear (to me, at least) why this should be given priority over all the other things such as interpreter crashes.
Sorry ... when I said "it hasn't been given priority" I mean "it hasn't been given *a* priority" - as in, nobody's assigned a priority to it, whatever that priority should rightfully be.
We all agree it's a bug: no, I don't. I think it's a missing feature, at best, but I'm staying out of the discussion. As-is, urllib only supports ASCII in URLs, and that is fine for most purposes.
Seriously, Mr. L%C3%B6wis, that's a tremendously na%C3%AFve statement.
URLs are just not made for non-ASCII characters. Implement IRIs if you want non-ASCII characters; the rules are much clearer for these.
Python 3.0 fully supports Unicode. URIs support *encoding* of arbitrary characters (as of more recent revisions). The difference is that URIs *may only consist* of ASCII characters (even though they can encode Unicode characters), while IRIs may also consist of Unicode characters. It's our responsibility to implement URIs here ... IRIs are a separate issue. Having said this, I'm pretty sure Martin can't be convinced, so I'll leave that alone. On Thu, Aug 7, 2008 at 3:34 AM, M.-A. Lemburg <mal@egenix.com> wrote:
So unquote() should probably try to decode using UTF-8 first
and then fall back to Latin-1 if that doesn't work. That's an interesting proposal. I think I don't like it - for a user application that's a good policy. But for a programming language library, I think it should not do guesswork. It should use the encoding supplied, and have a single default. But I'd be interested to hear if anyone else wants this. As-is, it passes 'replace' to the errors argument, so encoding errors get replaced by '�' characters. OK I haven't looked at the review yet .. guess it's off to the tracker :) Matt

FWIW, the rest of this discussion is now happening in the tracker: http://bugs.python.org/issue3300. We could really use some feedback from Python users in Asian countries. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Matt Giuca writes:
OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded.
In other words, it's an encoding for binary data, since the octet sequences that might be encountered are completely unrestricted. I have to side with Bill on this. URIs are sequences of characters, but the character set used must contain the ASCII repertoire as a subset, of which the URI delimiters must be mapped to the corresponding ASCII codes, the rest of the set must be represented as sequences of octets (which need not even be constant; you could gzip them first for all URI-encoding cares). URI-encoding itself is a purely mechanical process which transforms reserved octets (not used as delimiters) to percent codes.
From RFC 3986, section 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1>:
Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI.
This is kinda perverted, but suppose you have bytes which are actually a Japanese string represented in packed EUC-JP. AFAICS the paragraph above does *not* say you can't transcode to UTF-8 before percent-encoding, and in fact you might be required to by the definition of the scheme.
So the string->string proposal is actually correct behaviour.
Ye-e-es, but. What the RFC clearly envisions is not that the percent-encoder will be handed an unencoded string that looks like a URI, but rather a sequence of octets representing one component (scheme, authority, path, query, etc) of a URI. In other words, a string->string URI encoder should only be called by an URI builder, and never with a precomposed URI-like string. Something like def URIBuilder (strings): """Return an URI built from a list of strings. The first string *must* be the scheme. If the URI follows the generic URI syntax of RFC 3986, the remaining components should be given in the order authority, path, fragment, query part [, query part ...].""" def uriencode (s): """URI encode a string per RFC 3986 Section 3.""" # We all know what this does. if strings[0] == "http": # HTTP scheme, delimiters and authority uri = "http://" + uriencode(strings[1]) + "/" # path, if present if strings[2]: uri = uri + uriencode(strings[2]) # query, if present if strings[4]: uri = uri + "?" + uriencode(strings[4]) # further query parameters, if present for s in strings[4:] uri = uri + ";" + uriencode(s) # fragment, if present if strings[3]: uri = uri + "#" + uriencode(strings[3]) else if strings[0] == "mailto": uri = "mailto:" + uriencode(strings[1]) # etc etc return uri I think you'd have a much easier time enforcing this pedantically correct usage with a bytes->bytes encoder. Of course, it's un-Pythonic to enforce pedantry, and we pedants can use a string->string encoder correctly.
You really want me to remove the encoding= named argument? And hard-code UTF-8 into these functions?
A quoting function that accepts bytes *must* have an encoding argument. There's no point to passing the quoter bytes unless the text is represented in a non-Unicode encoding.

Of course, it's un-Pythonic to enforce pedantry, and we pedants can use a string->string encoder correctly.
Sure. All I was asking was that we not break the existing usage of the standard library "unquote" by producing a string by *assuming* a UTF-8 encoded string is what's in those percent-encoded bytes (instead of, say, ISO 2022-JP). Let the "new" function produce a string: "unquote_as_string".
You really want me to remove the encoding= named argument? And hard-code UTF-8 into these functions?
A quoting function that accepts bytes *must* have an encoding argument.
Huh? What would it use it for? The string, if string it is, is already encoded as octets. All it needs to do is percent-encode the reserved octets. So far as I can see, the "unquote_as_string" is the function that needs the encoding. Ah, it's too late, I'll pick this up tomorrow. Bill

Bill Janssen writes:
A quoting function that accepts bytes *must* have an encoding argument.
Huh? What would it use it for?
Ah, you're right. I was thinking in terms of an URI builder, where the quoter would do any required conversion (eg, if the bytes represented a string in Japanese) to another (possibly scheme-mandated) encoding (typically UTF-8). But that doesn't really make sense; the URI builder should know what to do, and that's a better place to do such conversions.

Also see <http://en.wikipedia.org/wiki/Percent-encoding>. Bill

Guido says:
Actually, we'd need to look at the various other APIs in Py3k before we can decide whether these should be considered taking or returning bytes or text. It looks like all other APIs in the Py3k version of urllib treat URLs as text.
Yes, as I said in the bug tracker, I've groveled over the entire stdlib to see how my patch affects the behaviour of dependent code. Aside from a few minor bits which assumed octets (and did their own encoding/decoding) (which I fixed), all the code assumes strings and is very happy to go on assuming this, as long as the URIs are encoded with UTF-8, which they almost certainly are.
I'm not sure that's sufficient review, though I agree it's necessary. The major consumers of quote/unquote are not in the Python standard library.
(quote will accept either type, while unquote will output a str, there will be a new function unquote_to_bytes which outputs a bytes - is everyone happy with that?)
No, so don't ask. Bill
participants (11)
-
"Martin v. Löwis"
-
André Malo
-
Antoine Pitrou
-
Bill Janssen
-
Guido van Rossum
-
Jeff Hall
-
M.-A. Lemburg
-
Matt Giuca
-
Scott Dial
-
Stephen J. Turnbull
-
Stephen J. Turnbull