[Python-Dev] urllib.quote and unquote - Unicode issues

Matt Giuca matt.giuca at gmail.com
Thu Aug 7 12:37:39 CEST 2008

Wow .. a lot of replies today!

On Thu, Aug 7, 2008 at 2:09 AM, "Martin v. Löwis" <martin at v.loewis.de>wrote:

> It hasn't been given priority: There are currently 606 patches in the
> tracker, many fixing bugs of some sort. It's not clear (to me, at least)
> why this should be given priority over all the other things such as
> interpreter crashes.

Sorry ... when I said "it hasn't been given priority" I mean "it hasn't been
given *a* priority" - as in, nobody's assigned a priority to it, whatever
that priority should rightfully be.

> We all agree it's a bug: no, I don't. I think it's a missing feature,
> at best, but I'm staying out of the discussion. As-is, urllib only
> supports ASCII in URLs, and that is fine for most purposes.

Seriously, Mr. L%C3%B6wis, that's a tremendously na%C3%AFve statement.

> URLs are just not made for non-ASCII characters. Implement IRIs if you
> want non-ASCII characters; the rules are much clearer for these.

Python 3.0 fully supports Unicode. URIs support *encoding* of arbitrary
characters (as of more recent revisions). The difference is that URIs *may
only consist* of ASCII characters (even though they can encode Unicode
characters), while IRIs may also consist of Unicode characters. It's our
responsibility to implement URIs here ... IRIs are a separate issue.

Having said this, I'm pretty sure Martin can't be convinced, so I'll leave
that alone.

On Thu, Aug 7, 2008 at 3:34 AM, M.-A. Lemburg <mal at egenix.com> wrote:

> So unquote() should probably try to decode using UTF-8 first
and then fall back to Latin-1 if that doesn't work.

That's an interesting proposal. I think I don't like it - for a user
application that's a good policy. But for a programming language library, I
think it should not do guesswork. It should use the encoding supplied, and
have a single default. But I'd be interested to hear if anyone else wants

As-is, it passes 'replace' to the errors argument, so encoding errors get
replaced by '�' characters.

OK I haven't looked at the review yet .. guess it's off to the tracker :)

