[Python-Dev] bytes / unicode

Tue Jun 22 07:31:16 CEST 2010

On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote:

> The RFC says that URIs are text, and therefore they can (and IMO
> should) be operated on as text in the stdlib.

No, *blue* is the best color for a shed.

Oops, wait, let me try that again.

While I broadly agree with this statement, it is really an oversimplification.  An URI is a structured object, with many different parts, which are transformed from bytes to ASCII (or something latin1-ish, which is really just bytes with a nice face on them) to real, honest-to-goodness text via the IRI specification: <http://tools.ietf.org/html/rfc3987>.

> Note also that the "complete solution" argument cuts both ways.  Eg, a
> "complete" solution should implement UTS 39 "confusables detection"[1]
> and IDNA[2].  Good luck doing that with bytes!

And good luck doing that with just characters, too.  You need a parsed representation of the URI that you can encode different parts of in different ways.  (My understanding is that you should only really implement confusables detection in the netloc... while that may be a bogus example, you're certainly only supposed to do IDNA in the netloc!)

You can just call urlsplit() all over the place to emulate this, but this does not give you the ability to go back to the original bytes, and thereby preserve things like brokenly-encoded segments, which seems to be what a lot of this hand-wringing is about.

To put it another way, there is no possible information-preserving string or bytes type that will make everyone happy as a result from urljoin().  The only return-type that gives you *everything* is "URI".

> just using 'latin-1' as the encoding allows you to
> use the (unicode) string operations internally, and then spew your
> mess out into the world for someone else to clean up, just as using
> bytes would.

This is the limitation that everyone seems to keep dancing around.  If you are using the stdlib, with functions that operate on sequences like 'str' or 'bytes', you need to choose from one of three options:

  1. "decode" everything to latin1 (although I prefer to call it "charmap" when used in this way) so that you can have some mojibake that will fool a function that needs a unicode object, but not lose any information about your input so that it can be transformed back into exact bytes (and be very careful to never pass it somewhere that it will interact with real text!),
  2. actually decode things to an appropriate encoding to be displayed to the user and manipulated with proper text-manipulation tools, and throw away information about the bytes,
  3. keep both the bytes and the characters together (perhaps in a data structure) so that you can both display the data and encode it in situationally-appropriate ways.

The stdlib as it is today is not going to handle the 3rd case for anyone.  I think that's fine; it is not the stdlib's job to solve everyone's problems.  I've been happy with it providing correctly-functioning pieces that can be used to build more elaborate solutions.  This is what I meant when I said I agree with Stephen's first point: the stdlib *should* just keep operating entirely on strings, because URIs are defined, by the spec, to be sequences of ASCII characters.  But that's not the whole story.

PJE's "bstr" and "ebytes" proposals set my teeth on edge.  I can totally understand the motivation for them, but I think it would be a big step backwards for python 3 to succumb to that temptation, even in the form of a third-party library.  It is really trying to cram more information into a pile of bytes than truly exists there.  (Also, if we're going to have encodings attached to bytes objects, I would very much like to add "JPEG" and "FLAC" to the list of possibilities.)

The real tension there is that WSGI is desperately trying to avoid defining any data structures (i.e. classes), while still trying to work with structured data.  An URI class with a 'child' method could handily solve this problem.  You could happily call IRI(...).join(some bytes).join(some text) and then just say "give me some bytes, it's time to put this on the network", or "give me some characters, I have to show something to the user", or even "give me some characters appropriate for an 'href=' target in some HTML I'm generating" - although that last one could be left to the HTML generator, provided it could get enough information from the URI/IRI object's various parts itself.

I don't mean to pick on WSGI, either.  This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of the time before, but now you need to keep track of text too and the functions which seemed to work on bytes no longer do.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100622/80fcaab6/attachment.html>