[XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate

Just van Rossum just@letterror.com
Tue, 2 May 2000 06:47:35 +0100


At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
>Here's one usage scenario.
>
>A Japanese user is reading lines from a file encoded in ISO-2022-JP.
>The readline() method returns 8-bit strings in that encoding (the file
>object doesn't do any decoding).  She realizes that she wants to do
>some character-level processing on the file so she decides to convert
>the strings to Unicode.
>
>I believe that whether the default encoding is UTF-8 or Latin-1
>doesn't matter for here -- both are wrong, she needs to write explicit
>unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
>"better", because interpreting ISO-2022-JP data as UTF-8 will most
>likely give an exception (when a \300 range byte isn't followed by a
>\200 range byte) -- while interpreting it as Latin-1 will silently do
>the wrong thing.  (An explicit error is always better than silent
>failure.)

But then it's even better to *always* raise an exception, since it's
entirely possible a string contains valid utf-8 while not *being* utf-8. I
really think the exception argument is moot, since there can *always* be
situations that will pass silently. Encoding issues are silent by nature --
eg. there's no way any system can tell that interpreting MacRoman data as
Latin-1 is wrong, maybe even fatal -- the user will just have to deal with
it. You can argue what you want, but *any* multi-byte encoding stored in an
8-bit string is a buffer, not a string, for all the reasons Fredrik and
Paul have thrown at you, and right they are. Choosing such an encoding as a
default conversion to Unicode makes no sense at all. Recap of the main
arguments:

pro UTF-8:
always reversible when going from Unicode to 8-bit

con UTF-8:
not a string: confusing semantics

pro Latin-1:
simpler semantics

con Latin-1:
non-reversible, western-centric

Given the fact that very often *both* will be wrong, I'd go for the simpler
semantics.

Just