[Python-Dev] [I18n-sig] Re: Unicode debate

28 Apr 2000

      Just van Rossum writes:
...
How will other parts of a program know which encoding was used for
non-unicode string literals?
This is the exact reason that Unicode should be used for all string
literals: from a language design perspective I don't understand the
rationale for providing "traditional" and "unicode" string.
...
It seems to me that an encoding attribute for 8-bit strings solves this
nicely. The attribute should only be set automatically if the encoding of
the source file was specified or when the string has been encoded from a
unicode string. The attribute should *only* be used when converting to
unicode. (Hm, it could even be used when calling unicode() without the
encoding argument.) It should *not* be used when comparing (or adding,
etc.) 8-bit strings to each other, since they still may contain binary
goop, even in a source file with a specified encoding!
In Dylan there is an explicit split between 'characters' (which are
always Unicode) and 'bytes'.

What are the compelling reasons to not use UTF-8 as the (source)
document encoding? In the past the usual response is, "the tools are't
there for authoring UTF-8 documents". This argument becomes more
specious as more OS's move towards Unicode. I firmly believe this can
be done without Java's bloat.

One off-the-cuff solution is this:

All character strings are Unicode (utf-8 encoding). Language terminals
and operators are restricted to US-ASCII, which are identical to
UTF8. The contents of comments are not interpreted in any way.
...
...
- We need a way to indicate the encoding of input and output data
files, and we need shortcuts to set the encoding of stdin, stdout and
stderr (and maybe all files opened without an explicit encoding).
Can you open a file *with* an explicit encoding?
If you cannot, you lose. You absolutely must be able to specify the
encoding of a file when opening it, so that the runtime can transcode
into the native encoding as you read it. This should be otherwise
transparent the user.

            -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"