[Python-Dev] bytes / unicode

Wed Jun 23 09:07:50 CEST 2010

James Y Knight writes:

 > The surrogateescape method is a nice workaround for this, but I can't  
 > help thinking that it might've been better to just treat stuff as  
 > possibly-invalid-but-probably-utf8 byte-strings from input, through  
 > processing, to output.

This is the world we already have, modulo s/utf8/ascii + random GR
charset/.  It doesn't work, and it can't, in Japan or China or Korea,
and probably not in Russia or Kazakhstan, for some time yet.

That's not to say that byte-oriented processing doesn't have its
place.  And in many cases it's reasonable (but not secure or
bulletproof!) to assume ASCII compatibility of the byte stream,
passing through syntactically unimportant bytes verbatim.  Syntactic
analysis of such streams will surely have a lot in common with that
for text streams, so the same tools should be available.  (That's the
point of Guido's endorsement of polymorphism, AIUI.)

But it's just not reasonable to assume that will work in a context
where text streams from various sources are mixed with byte streams.
In that case, the byte streams need to be converted to text before
mixing.  (You can't do it the other way around because there is no
guarantee that the text is compatible with the current encoding of the
byte stream, nor that all the byte streams have the same encoding.)

We do need str-based implementations of modules like urllib.