[Python-Dev] default encoding for 8-bit string literals (was Unicode and comparisons)

Guido van Rossum guido@python.org
Wed, 05 Apr 2000 18:37:24 -0400


[MAL]
> > > u"..." currently interprets the characters it finds as Latin-1
> > > (this is by design, since the first 256 Unicode ordinals map to
> > > the Latin-1 characters).

[GvR]
> > Nice, except that now we seem to be ambiguous about the source
> > character encoding: it's Latin-1 for Unicode strings and UTF-8 for
> > 8-bit strings...!

[MAL]
> Noo... there is no definition for non-ASCII 8-bit strings in
> Python source code using the ordinal range 127-255. If you were
> to define Latin-1 as source code encoding, then we would have
> to change auto-coercion to make a Latin-1 assumption instead, but...
> I see the picture: people are getting pretty confused about what
> is going on.
> 
> If you write u"xyz" then the ordinals of those characters are
> taken and stored directly as Unicode characters. If you live
> in a Latin-1 world, then you happen to be lucky: the Unicode
> characters match your input. If not, some totally different
> characters are likely to show if the string were written
> to a file and displayed using a Unicode aware editor.
> 
> The same will happen to your normal 8-bit string literals.
> Nothing unusual so far... if you use Latin-1 strings and
> write them to a file, you get Latin-1. If you happen to
> program on DOS, you'll get the DOS ANSI encoding for the
> German umlauts.
> 
> Now the key point where all this started was that 
> u'ä' in 'äöü' will raise an error due to 'äöü' being
> *interpreted* as UTF-8 -- this doesn't mean that 'äöü'
> will be interpreted as UTF-8 elsewhere in your application.
> 
> The UTF-8 assumption had to be made in order to get the two
> worlds to interoperate. We could have just as well chosen
> Latin-1, but then people currently using say a Russian
> encoding would get upset for the same reason.
> 
> One way or another somebody is not going to like whatever
> we choose, I'm afraid... the simplest solution is to use
> Unicode for all strings which contain non-ASCII characters
> and then call .encode() as necessary.

I have a different view on this (except that I agree that it's pretty
confusing :-).  In my definition of a "source character encoding",
string literals, whether Unicode or 8-bit strings, are translated from
the source encoding to the corresponding run-time values.  If I had a
C compiler that read its source in EBCDIC but cross-compiled to a
machine that used ASCII, I would expect that 'a' in the source would
have the integer value 97 (ASCII 'a'), regardless of the EBCDIC value
for 'a'.

If I type a non-ASCII Latin-1 character in a Unicode literal, it
generates the corresponding Unicode character.  This means to me that
the source character encoding is Latin-1.  But when I type the same
character in an 8-bit character literal, that literal is interpreted
as UTF-8 (e.g. when converting to Unicode using the default
conversions).

Thus, even though you can do whatever you want with 8-bit literals in
your program, the most defensible view is that they are UTF-8
encoded.

I would be much happier if all source code was encoded in the same
encoding, because otherwise there's no good way to view such code in a
general Unicode-aware text viewer!

My preference would be to always use UTF-8.  This would mean no change
for 8-bit literals, but a big change for Unicode literals...  And a
break with everyone who's currently typing Latin-1 source code and
using strings as Latin-1.  (Or Latin-7, or whatever.)

My next preference would be a pragma to define the source encoding,
but that's a 1.7 issue.  Maybe the whole thing is... :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)