[Python-Dev] [I18n-sig] Re: Unicode debate

Tom Emerson tree@basistech.com
Fri, 28 Apr 2000 06:44:00 -0400 (EDT)


Just van Rossum writes:
 > How will other parts of a program know which encoding was used for
 > non-unicode string literals?

This is the exact reason that Unicode should be used for all string
literals: from a language design perspective I don't understand the
rationale for providing "traditional" and "unicode" string.

 > It seems to me that an encoding attribute for 8-bit strings solves this
 > nicely. The attribute should only be set automatically if the encoding of
 > the source file was specified or when the string has been encoded from a
 > unicode string. The attribute should *only* be used when converting to
 > unicode. (Hm, it could even be used when calling unicode() without the
 > encoding argument.) It should *not* be used when comparing (or adding,
 > etc.) 8-bit strings to each other, since they still may contain binary
 > goop, even in a source file with a specified encoding!

In Dylan there is an explicit split between 'characters' (which are
always Unicode) and 'bytes'.

What are the compelling reasons to not use UTF-8 as the (source)
document encoding? In the past the usual response is, "the tools are't
there for authoring UTF-8 documents". This argument becomes more
specious as more OS's move towards Unicode. I firmly believe this can
be done without Java's bloat.

One off-the-cuff solution is this:

All character strings are Unicode (utf-8 encoding). Language terminals
and operators are restricted to US-ASCII, which are identical to
UTF8. The contents of comments are not interpreted in any way.

 > >- We need a way to indicate the encoding of input and output data
 > >files, and we need shortcuts to set the encoding of stdin, stdout and
 > >stderr (and maybe all files opened without an explicit encoding).
 > 
 > Can you open a file *with* an explicit encoding?

If you cannot, you lose. You absolutely must be able to specify the
encoding of a file when opening it, so that the runtime can transcode
into the native encoding as you read it. This should be otherwise
transparent the user.

            -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"