[Python-Dev] bytes / unicode

Fri Jun 25 09:49:16 CEST 2010

P.J. Eby writes:

 > This doesn't have to be in the functions; it can be in the 
 > *types*.  Mixed-type string operations have to do type checking and 
 > upcasting already, but if the protocol were open, you could make an 
 > encoded-bytes type that would handle the error checking.

Don't you realize that "encoded-bytes" is equivalent to use of a very
limited profile of ISO 2022 coding extensions?  Such as Emacs/MULE
internal encoding or TRON code?  It has been tried.  It does not work.

I understand how types can do such checking; my point is that the
encoded-bytes type doesn't have enough information to do it in the
cases where you think it is better than converting to str.  There are
*no useful operations* that can be done on two encoded-bytes with
different encodings unless you know the ultimate target codec.  The
only sensible way to define the concatenation of ('ascii', 'English')
with ('euc-jp','ÆüËÜ¸ì') is something like ('ascii', 'English',
'euc-jp','ÆüËÜ¸ì'), and *not* ('euc-jp','EnglishÆüËÜ¸ì'), because you
don't know that the ultimate target codec is 'euc-jp'-compatible.
Worse, you need to build in all the information about which codecs are
mutually compatible into the encoded-bytes type.  For example, if the
ultimate target is known to be 'shift_jis', it's trivially compatible
with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't
have.

 > (Btw, in some earlier emails, Stephen, you implied that this could be 
 > fixed with codecs -- but it can't, because the problem isn't with the 
 > bytes containing invalid Unicode, it's with the Unicode containing 
 > invalid bytes -- i.e., characters that can't be encoded to the 
 > ultimate codec target.)

No, the problem is not with the Unicode, it is with the code that
allows characters not encodable with the target codec.  If you don't
have a target codec, there are ascii-safe source codecs, such as
'latin-1' or 'ascii' with surrogateescape, that will work any time
that bytes-oriented processing can work.