[Python-Dev] bytes / unicode

Tue Jun 22 04:58:57 CEST 2010

Toshio Kuratomi writes:

 > One comment here -- you can also have uri's that aren't decodable into their
 > true textual meaning using a single encoding.
 > 
 > Apache will happily serve out uris that have utf-8, shift-jis, and
 > euc-jp components inside of their path but the textual
 > representation that was intended will be garbled (or be represented
 > by escaped byte sequences).  For that matter, apache will serve
 > requests that have no true textual representation as it is working
 > on the byte level rather than the character level.

Sure.  I've never seen that combination, but I have seen Shift JIS and
KOI8-R in the same path.

But in that case, just using 'latin-1' as the encoding allows you to
use the (unicode) string operations internally, and then spew your
mess out into the world for someone else to clean up, just as using
bytes would.

 > So a complete solution really should allow the programmer to pass
 > in uris as bytes when the programmer knows that they need it.

Other than passing bytes into a constructor, I would argue if a
complete solution requires, eg, an interface that allows
urljoin(base,subdir) where the types of base and subdir are not
required to match, then it doesn't belong in the stdlib.  For stdlib
usage, that's premature optimization IMO.

The RFC says that URIs are text, and therefore they can (and IMO
should) be operated on as text in the stdlib.  It's not just a matter
of manipulating the URIs themselves, where working directly on bytes
will work just as well and and with the same string operations (as
long as everything is bytes).  It's also a question of API complexity
(eg, Barry's bugaboo of proliferation of encoding= parameters) and of
debugging (if URIs are internally str, then they will display sanely
in tracebacks and the interpreter).

The cases where URIs can't be sanely treated as text are garbage
input, and the stdlib should not try to provide a solution.  Just
passing in bytes and getting out bytes is GIGO.  Trying to do "some"
error-checking is going to be insufficient much of the time and overly
strict most of the rest of the time.  The programmer in the trenches
is going to need to decide what to allow and what not; I don't think
there are general answers because we know that allowing random URLs on
the web leads to various kinds of problems.  Some sites will need to
address some of them.

Note also that the "complete solution" argument cuts both ways.  Eg, a
"complete" solution should implement UTS 39 "confusables detection"[1]
and IDNA[2].  Good luck doing that with bytes!

If you *need* bytes (rather than simply trying to avoid conversion
overhead), you're in a hazmat handling situation.  Passing bytes in to
stdlib APIs here is the equivalent of carrying around kilograms of
fissionables in an open bucket.  While the Tokaimura comparison is
hyperbole, it can't be denied that use of bytes here shortcuts a lot
of processing strongly suggested by the RFCs, and prevents use of
various programming conveniences (such as reasonable display of URI
values in debugging).  Does the efficiency really justify including
that in the stdlib?  I dunno, I'm not a web programmer in the
trenches.  But I take my cue from MvL and MAL who don't seem real
enthusiastic about this.

And as Martin says, there is as yet no evidence offered that the
overhead of conversion is a general problem.

Footnotes: 
[1]  http://www.unicode.org/reports/tr39/

[2]  http://www.rfc-editor.org/rfc/rfc3490.txt