[I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Tue, 02 May 2000 14:56:34 -0400


> It's the naive user who will be surprised by these random UTF-8 decoding
> errors. 
> 
> That's why this is NOT a convenience issue (are you listening MAL???).
> It's a short and long term simplicity issue. There are lots of languages
> where it is de rigeur to discover and work around inconvenient and
> confusing default behaviors. I just don't think that we should be ADDING
> such behaviors.

So what do you think of my new proposal of using ASCII as the default
"encoding"?  It takes care of "a character is a character" but also
(almost) guarantees an error message when mixing encoded 8-bit strings
with Unicode strings without specifying an explicit conversion --
*any* 8-bit byte with the top bit set is rejected by the default
conversion to Unicode.

I think this is less confusing than Latin-1: when an unsuspecting user
is reading encoded text from a file into 8-bit strings and attempts to
use it in a Unicode context, an error is raised instead of producing
garbage Unicode characters.

It encourages the use of Unicode strings for everything beyond ASCII
-- there's no way around ASCII since that's the source encoding etc.,
but Latin-1 is an inconvenient default in most parts of the world.
ASCII is accepted everywhere as the base character set (e.g. for
email and for text-based protocols like FTP and HTTP), just like
English is the one natural language that we can all sue to communicate
(to some extent).

--Guido van Rossum (home page: http://www.python.org/~guido/)