[Python-Dev] bytes / unicode

Tue Jun 22 13:31:13 CEST 2010

Toshio Kuratomi writes:

 > I'll definitely buy that.  Would urljoin(b_base, b_subdir) => bytes and
 > urljoin(u_base, u_subdir) => unicode be acceptable though?

Probably.  

But it doesn't matter what I say, since Guido has defined that as
"polymorphism" and approved it in principle.

 > (I think, given other options, I'd rather see two separate
 > functions, though.

Yes.

 > If you want to deal with things like this::
 >   http://host/café

Yes.

 > At that point you are no longer dealing with the sequence of
 > characters talked about in the RFC.  You are dealing with data
 > which may or may not be text.

That's right, and I think that in most cases that is what programmers
want to be dealing with.  Let the library make sure that what goes on
the wire conforms to the RFC.  I don't want to know about it, I want
to work with the content of the URI.

 > The proliferation of encoding I agree is a thing that is ugly.
 > Although, if I'm thinking correctly, that only matters when you
 > want to allow mixing bytes and unicode, correct?

Well you need to know a fair amount about the encoding: that the
reserved bytes are used as defined in the RFC, for example.

 > For debugging, I'm either not understanding or you're wrong.  If I'm given
 > an arbitrary sequence of bytes how do I sanely store them as str internally?

If it's really arbitrary, you use either a mapping to private space or
PEP 383, and accept that it won't make sense.  But in most cases you
should be able to achieve a fair degree of sanity.

 > If I transform them using an encoding that anticipates the full range of
 > bytes I may be able to display some representation of them but it's not
 > necessarily the sanest method of display (for instance, if I know that path
 > element 1 is always going to be a utf8 encoded string and path element 2 is
 > always shift-jis encoded, and path element 3 is binary data, I could
 > construct a much saner display method than treating the whole thing as
 > latin1).

And I think in most cases you will know, although the cases where
you'll know will be because of a system-wide encoding.

 > What is your basis for asserting that URIs that aren't sanely treated as
 > text are garbage?

I don't mean we can throw them away, I mean we can't do any sensible
processing on them.  You at least need to know about the reseved
delimiters.  In the same way that Philip used 'garbage' for the
"unknown" encoding.  And in the sense of "garbage in, garbage out".

 > unicode handling redesign.  I'm stating my reading of the RFC not to defend
 > the use case Philip has, but because I think that the outlook that non-text
 > uris (before being percentencoded) are violations of the RFC

That's not what I'm saying.  What I'm trying to point out is that
manipulating a bytes object as an URI sort of presumes a lot about its
encoding as text.  Since many of the URIs we deal with are more or
less textual, why not take advantage of that?