[I18n-sig] Re: Unicode debate

Tom Emerson tree@basistech.com
Fri, 28 Apr 2000 06:44:00 -0400 (EDT)

Just van Rossum writes:
 > How will other parts of a program know which encoding was used for
 > non-unicode string literals?

This is the exact reason that Unicode should be used for all string
literals: from a language design perspective I don't understand the
rationale for providing "traditional" and "unicode" string.

 > It seems to me that an encoding attribute for 8-bit strings solves this
 > nicely. The attribute should only be set automatically if the encoding of
 > the source file was specified or when the string has been encoded from a
 > unicode string. The attribute should *only* be used when converting to
 > unicode. (Hm, it could even be used when calling unicode() without the
 > encoding argument.) It should *not* be used when comparing (or adding,
 > etc.) 8-bit strings to each other, since they still may contain binary
 > goop, even in a source file with a specified encoding!

In Dylan there is an explicit split between 'characters' (which are
always Unicode) and 'bytes'.

What are the compelling reasons to not use UTF-8 as the (source)
document encoding? In the past the usual response is, "the tools are't
there for authoring UTF-8 documents". This argument becomes more
specious as more OS's move towards Unicode. I firmly believe this can
be done without Java's bloat.

One off-the-cuff solution is this:

All character strings are Unicode (utf-8 encoding). Language terminals
and operators are restricted to US-ASCII, which are identical to
UTF8. The contents of comments are not interpreted in any way.

 > >- We need a way to indicate the encoding of input and output data
 > >files, and we need shortcuts to set the encoding of stdin, stdout and
 > >stderr (and maybe all files opened without an explicit encoding).
 > Can you open a file *with* an explicit encoding?

If you cannot, you lose. You absolutely must be able to specify the
encoding of a file when opening it, so that the runtime can transcode
into the native encoding as you read it. This should be otherwise
transparent the user.


Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"