[Python-Dev] default encoding for 8-bit string literals (was Unicode and comparisons)

M.-A. Lemburg mal@lemburg.com
Wed, 05 Apr 2000 20:32:26 +0200


Guido van Rossum wrote:
> 
> > u"..." currently interprets the characters it finds as Latin-1
> > (this is by design, since the first 256 Unicode ordinals map to
> > the Latin-1 characters).
> 
> Nice, except that now we seem to be ambiguous about the source
> character encoding: it's Latin-1 for Unicode strings and UTF-8 for
> 8-bit strings...!

Noo... there is no definition for non-ASCII 8-bit strings in
Python source code using the ordinal range 127-255. If you were
to define Latin-1 as source code encoding, then we would have
to change auto-coercion to make a Latin-1 assumption instead, but...
I see the picture: people are getting pretty confused about what
is going on.

If you write u"xyz" then the ordinals of those characters are
taken and stored directly as Unicode characters. If you live
in a Latin-1 world, then you happen to be lucky: the Unicode
characters match your input. If not, some totally different
characters are likely to show if the string were written
to a file and displayed using a Unicode aware editor.

The same will happen to your normal 8-bit string literals.
Nothing unusual so far... if you use Latin-1 strings and
write them to a file, you get Latin-1. If you happen to
program on DOS, you'll get the DOS ANSI encoding for the
German umlauts.

Now the key point where all this started was that 
u'ä' in 'äöü' will raise an error due to 'äöü' being
*interpreted* as UTF-8 -- this doesn't mean that 'äöü'
will be interpreted as UTF-8 elsewhere in your application.

The UTF-8 assumption had to be made in order to get the two
worlds to interoperate. We could have just as well chosen
Latin-1, but then people currently using say a Russian
encoding would get upset for the same reason.

One way or another somebody is not going to like whatever
we choose, I'm afraid... the simplest solution is to use
Unicode for all strings which contain non-ASCII characters
and then call .encode() as necessary.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/