[Python-Dev] bytes / unicode
Stephen J. Turnbull
stephen at xemacs.org
Fri Jun 25 09:49:16 CEST 2010
P.J. Eby writes:
> This doesn't have to be in the functions; it can be in the
> *types*. Mixed-type string operations have to do type checking and
> upcasting already, but if the protocol were open, you could make an
> encoded-bytes type that would handle the error checking.
Don't you realize that "encoded-bytes" is equivalent to use of a very
limited profile of ISO 2022 coding extensions? Such as Emacs/MULE
internal encoding or TRON code? It has been tried. It does not work.
I understand how types can do such checking; my point is that the
encoded-bytes type doesn't have enough information to do it in the
cases where you think it is better than converting to str. There are
*no useful operations* that can be done on two encoded-bytes with
different encodings unless you know the ultimate target codec. The
only sensible way to define the concatenation of ('ascii', 'English')
with ('euc-jp','ÆüËܸì') is something like ('ascii', 'English',
'euc-jp','ÆüËܸì'), and *not* ('euc-jp','EnglishÆüËܸì'), because you
don't know that the ultimate target codec is 'euc-jp'-compatible.
Worse, you need to build in all the information about which codecs are
mutually compatible into the encoded-bytes type. For example, if the
ultimate target is known to be 'shift_jis', it's trivially compatible
with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't
have.
> (Btw, in some earlier emails, Stephen, you implied that this could be
> fixed with codecs -- but it can't, because the problem isn't with the
> bytes containing invalid Unicode, it's with the Unicode containing
> invalid bytes -- i.e., characters that can't be encoded to the
> ultimate codec target.)
No, the problem is not with the Unicode, it is with the code that
allows characters not encodable with the target codec. If you don't
have a target codec, there are ascii-safe source codecs, such as
'latin-1' or 'ascii' with surrogateescape, that will work any time
that bytes-oriented processing can work.
More information about the Python-Dev
mailing list