[Python-Dev] [I18n-sig] Re: Unicode debate
Tom Emerson
tree@basistech.com
Fri, 28 Apr 2000 06:44:00 -0400 (EDT)
Just van Rossum writes:
> How will other parts of a program know which encoding was used for
> non-unicode string literals?
This is the exact reason that Unicode should be used for all string
literals: from a language design perspective I don't understand the
rationale for providing "traditional" and "unicode" string.
> It seems to me that an encoding attribute for 8-bit strings solves this
> nicely. The attribute should only be set automatically if the encoding of
> the source file was specified or when the string has been encoded from a
> unicode string. The attribute should *only* be used when converting to
> unicode. (Hm, it could even be used when calling unicode() without the
> encoding argument.) It should *not* be used when comparing (or adding,
> etc.) 8-bit strings to each other, since they still may contain binary
> goop, even in a source file with a specified encoding!
In Dylan there is an explicit split between 'characters' (which are
always Unicode) and 'bytes'.
What are the compelling reasons to not use UTF-8 as the (source)
document encoding? In the past the usual response is, "the tools are't
there for authoring UTF-8 documents". This argument becomes more
specious as more OS's move towards Unicode. I firmly believe this can
be done without Java's bloat.
One off-the-cuff solution is this:
All character strings are Unicode (utf-8 encoding). Language terminals
and operators are restricted to US-ASCII, which are identical to
UTF8. The contents of comments are not interpreted in any way.
> >- We need a way to indicate the encoding of input and output data
> >files, and we need shortcuts to set the encoding of stdin, stdout and
> >stderr (and maybe all files opened without an explicit encoding).
>
> Can you open a file *with* an explicit encoding?
If you cannot, you lose. You absolutely must be able to specify the
encoding of a file when opening it, so that the runtime can transcode
into the native encoding as you read it. This should be otherwise
transparent the user.
-tree
--
Tom Emerson Basis Technology Corp.
Language Hacker http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"