[Python-Dev] bytes / unicode

Tue Jun 22 19:03:29 CEST 2010

On Tue, Jun 22, 2010 at 6:31 AM, Stephen J. Turnbull <stephen at xemacs.org>wrote:

> Toshio Kuratomi writes:
>
>  > I'll definitely buy that.  Would urljoin(b_base, b_subdir) => bytes and
>  > urljoin(u_base, u_subdir) => unicode be acceptable though?
>
> Probably.
>
> But it doesn't matter what I say, since Guido has defined that as
> "polymorphism" and approved it in principle.
>
>  > (I think, given other options, I'd rather see two separate
>  > functions, though.
>
> Yes.
>
>  > If you want to deal with things like this::
>  >   http://host/café <http://host/caf%C3%A9>
>
> Yes.
>

Just for perspective, I don't know if I've ever wanted to deal with a URL
like that.  I know how it is supposed to work, and I know what a browser
does with that, but so many tools will clean that URL up *or* won't be able
to deal with it at all that it's not something I'll be passing around.  So
from a practical point of view this really doesn't come up, and if it did it
would be in a situation where you could easily do something ad hoc (though
there is not currently a routine to quote unsafe characters in a URL... that
would be helpful, though maybe urllib.quote(url.encode('utf8'), '%/:') would
do it).

Also while it is problematic to treat the URL-unquoted value as text
(because it has an unknown encoding, no encoding, or regularly a mixture of
encodings), the URL-quoted value is pretty easy to pass around, and
normalization (in this case to http://host/caf%C3%A9) is generally fine.

While it's nice to be correct about encodings, sometimes it is impractical.
And it is far nicer to avoid the situation entirely.  That is, decoding
content you don't care about isn't just inefficient, it's complicated and
can introduce errors.  The encoding of the underlying bytes of a %-decoded
URL is largely uninteresting.  Browsers (whose behavior drives a lot of
convention) don't touch any of that encoding except lately occasionally to
*display* some data in a more friendly way.  But it's only display, and
errors just make it revert to the old encoded display.

Similarly I'd expect (from experience) that a programmer using Python to
want to take the same approach, sticking with unencoded data in nearly all
situations.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100622/ebd3c163/attachment.html>