Just van Rossum writes:
How will other parts of a program know which encoding was used for non-unicode string literals?
This is the exact reason that Unicode should be used for all string literals: from a language design perspective I don't understand the rationale for providing "traditional" and "unicode" string.
It seems to me that an encoding attribute for 8-bit strings solves this nicely. The attribute should only be set automatically if the encoding of the source file was specified or when the string has been encoded from a unicode string. The attribute should *only* be used when converting to unicode. (Hm, it could even be used when calling unicode() without the encoding argument.) It should *not* be used when comparing (or adding, etc.) 8-bit strings to each other, since they still may contain binary goop, even in a source file with a specified encoding!
In Dylan there is an explicit split between 'characters' (which are always Unicode) and 'bytes'.
What are the compelling reasons to not use UTF-8 as the (source) document encoding? In the past the usual response is, "the tools are't there for authoring UTF-8 documents". This argument becomes more specious as more OS's move towards Unicode. I firmly believe this can be done without Java's bloat.
One off-the-cuff solution is this:
All character strings are Unicode (utf-8 encoding). Language terminals and operators are restricted to US-ASCII, which are identical to UTF8. The contents of comments are not interpreted in any way.
- We need a way to indicate the encoding of input and output data
files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding).
Can you open a file *with* an explicit encoding?
If you cannot, you lose. You absolutely must be able to specify the encoding of a file when opening it, so that the runtime can transcode into the native encoding as you read it. This should be otherwise transparent the user.