Thanks for all the replies, and making me feel welcome :)<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">
<br>
</div>If what you are saying is true, then it can probably go in as a bug<br>
fix (unless someone else knows something about Latin-1 on the Net that<br>
makes this not true).<br>
<div class="Ih2E3d"></div></blockquote><div><br>Well from what I've seen, the only time Latin-1 naturally appears on the net is when you have a web page in Latin-1 (either explicit or inferred; and note that a browser like Firefox will infer Latin-1 if it sees only ASCII characters) with a form in it. Submitting the form, the browser will use Latin-1 to percent-encode the query string.<br>
<br>So if you write a web app and you don't have any non-ASCII characters or mention the charset, chances are you'll get Latin-1. But I would argue you're leaving things to chance and you deserve to get funny behaviour. If you do any of the following:<br>
<ul><li>Use a non-ASCII character, encoded as UTF-8 on the page.</li><li>Send a Content-Type: xxxx; charset=utf-8.</li><li>In HTML, set a <meta http-equiv="Content-Type: xxxx; charset=utf-8" />.</li><li>In the form itself, set <form accept-encoding="utf-8">.</li>
</ul>then the browser will encode the form data as UTF-8. And most "proper" web pages should get themselves explicitly served as UTF-8.<br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
That I can't say I can necessarily due; have my own bug reports to<br>
work through this weekend. =)</blockquote><div><br>OK well I'm busy for the next few days; after that I can do a patch trade with someone. (That is if I am allowed to do reviews; not sure since I don't have developer privileges).<br>
</div></div><br><br><div class="gmail_quote">On Sun, Jul 13, 2008 at 5:58 AM, Mark Hammond <<a href="mailto:mhammond@skippinet.com.au">mhammond@skippinet.com.au</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d">> My first post to the list. In fact, first time Python hacker,<br>
> long-time Python user though. (Melbourne, Australia).<br>
<br>
</div>Cool - where exactly? I'm in Wantirna (although not at this very moment -<br>
I'm in Lithuania, but home again in a couple of days)</blockquote><div><br>Cool :) Balwyn.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
* Please take Martin with a grain of salt ( \I would say "ignore him", but<br>
that is too strong ;)</blockquote><div><br>Lol, he is a hard man to please, but he's given some good feedback.<br></div></div><br><br><div class="gmail_quote">On Sun, Jul 13, 2008 at 7:07 AM, Bill Janssen <<a href="mailto:janssen@parc.com">janssen@parc.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br>
</div>The standard here is RFC 3986, from Jan 2005, which says,<br>
<br>
``When a new URI scheme defines a component that represents textual<br>
data consisting of characters from the Universal Character Set [UCS],<br>
the data should first be encoded as octets according to the UTF-8<br>
character encoding [STD63]; then only those octets that do not<br>
correspond to characters in the unreserved set should be<br>
percent-encoded.''</blockquote><div><br>Ah yes, I was originally hung up on the idea that "URLs had to be encoded in UTF-8", till Martin pointed out that it only says "new URI scheme" there. It's perfectly valid to have non-UTF-8-encoded URIs. However in practice they're almost always UTF-8. So I think introducing the new encoding argument and having it default to "utf-8" is quite reasonable.<br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I'd say, treat the incoming data as either Unicode (if it's a Unicode<br>
string), or some unknown superset of ASCII (which includes both<br>
Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown<br>
encoding), and apply the appropriate transformation.<br>
<font color="#888888"></font></blockquote></div><br>Ah there may be some confusion here. We're only dealing with str->str transformations (which in Python 3 means Unicode strings). You can't put a bytes in or get a bytes out of either of these functions. I suggested a "quote_raw" and "unquote_raw" function which would let you do this.<br>
<br>The issue is with the percent-encoded characters in the URI string, which must be interpreted as bytes, not code points. How then do you convert these into a Unicode string? (Python 2 did not have this problem, since you simply output a byte string without caring about the encoding).<br>
<br><div class="gmail_quote">On Sun, Jul 13, 2008 at 9:10 AM, "Martin v. Löwis" <<a href="mailto:martin@v.loewis.de">martin@v.loewis.de</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d">> Very nice, I had this somewhere on my todo list to work on. I'm very much<br>
> in favour, especially since it synchronizes us with the RFCs (for all I<br>
> remember reading about it last time).<br>
<br>
</div>I still think that it doesn't. The RFCs haven't changed, and can't<br>
change for compatibility reasons. The encoding of non-ASCII characters<br>
in URLs remains as underspecified as it always was.</blockquote><div><br>Correct. But my patch brings us in-line with that unspecification. The unpatched version forces you to use Latin-1. My patch lets you specify the encoding to use.<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Now, with IRIs, the situation is different, but I don't think the patch<br>
claims to implement IRIs (and if so, it perhaps shouldn't change URL<br>
processing in doing so).</blockquote><div><br>True. I don't claim to have implemented IRIs or even know enough about them to do that. I'll read up on these things in the next few days.<br><br>However, this is a URI library, not IRI. From what I've seen, it's percent-encoded URIs coming in from the browser, not IRIs. We just need to make sure with this patch that IRIs don't become less-supported than they were before; don't need to explicitly support them.<br>
<br>Cheers,<br>Matt Giuca<br></div></div>