[I18n-sig] Changing case

Tue, 11 Apr 2000 12:56:21 -0400

> What direction should we be heading: interpret the source
> files under some encoding assumption deduced from the
> platform, a command line switch or a #pragma, or simply fix
> one encoding (e.g. Latin-1) ?

I think we'll have to allow user-specified encodings -- including
UTF-8 and eventually UTF-16.  How these are communicated to the parser
is a separate design issue; we could start with a command line switch
(assuming the standard library is ASCII only) and later migrate to a
per-file pragma.  There should also be a default encoding; I would
propose UTF-8, as this is already the default encoding used at
run-time.  (And because it annoys everyone roughly equally. :-)

Once we know the source encoding, it's obvious what to do with Unicode
literals: translate from the input encoding.

I want to propose a very simple rule for 8-bit literals: these use the
source encoding -- in other words, they aren't changed from what is
read from the file.  This is most likely to yield what the user wants.
Especially if the user doesn't use Unicode explicitly (neither
literals nor via conversions) the user sees their native character set
when editing the source file, and probably uses the same encoding for
output files, so if the user simply prints strings, the right thing
should happen automatically.

If the user *does* use Unicode conversions, the user has to specify
their encoding explicitly (unless it's UTF-8).  This seems only fair
-- the runtime can't know whether an 8-bit string being converted to
Unicode started its life as an 8-bit literal or whether it was read
from a file with an encoding that may only be known to the user.

> The current divergence between u"...chars..." and "...chars..."
> really only stems from the fact that "...chars..." doesn't
> have to know about the used encoding, while u"...chars..." does
> to be able to convert the data to Unicode.

Right.  Hence my deduction that currently the source encoding is
Latin-1.

> Note that even if the parser would know the encoding, you'd
> still have a problem processing the strings at run-time:
> 8-bit strings do not carry any encoding information.
> The only ways to fix this would be to define a global 8-bit
> string encoding or add an encoding attribute to strings.

The former we decided against -- the latter can be done by the user
(sublcassing UserString).

> One possible way would be to define that all 8-bit strings
> get converted to UTF-8 when parsed (by the compiler, eval(), etc.).
> This would assure that all strings used at run-time would
> in fact be UTF-8 and conversions to and from Unicode would
> be possible without information loss.

No -- this does NOT guarantee that all 8-bit strings are UTF-8.  It
doesn't cover strings explicitly encoded using octal escapes, and
(much more importantly) it doesn't cover strings read from files or
sockets or constructed in other ways.

(We can know that all strings we get out of Tkinter are UTF-8 encoded
though!  Provided we're using Tcl/Tk 8.1 or higher.)

> The downside of this approach is that indexing and slicing do
> not work well with UTF-8: a single input character can be
> encoded by as much as 6 bytes (for 32-bit Unicode) ! I also
> assume that many applications rely on the fact that
> len("äö") == 2 and not 4.

Agreed.  If we tried to make everything UTF-8, we should never have
started down the path of a separate Unicode string datatype.

I say: 8-bit strings have no fixed encoding -- they are 8-bit bytes
and their interpretation is determined by the program.  The default of
UTF-8 when converting to a Unicode string is just because we need a
default.

> Perhaps we should just loosen the used encoding for u"...chars..."
> using #pragmas and/or cmd line switches. Then people around the
> world would at least have a simple way to write programs which
> still work everywhere, but can be written using any of the
> encodings known to Python. 8-bit "...chars..." would then
> be interpreted as before: user defined data using a user
> defined encoding (the string->Unicode conversion would still
> need to make the UTF-8 assumption, though).

This sounds like my proposal.  Let's do it.

--Guido van Rossum (home page: http://www.python.org/~guido/)