[I18n-sig] Unicode debate
Tue, 2 May 2000 01:42:51 -0700 (PDT)
I'll warn you that i'm not much experienced or well-informed, but
i suppose i might as well toss in my naive opinion.
At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
> I believe that whether the default encoding is UTF-8 or Latin-1
> doesn't matter for here -- both are wrong, she needs to write explicit
> unicode(line, "iso-2022-jp") code anyway. I would argue that UTF-8 is
> "better", because [this] will most likely give an exception...
On Tue, 2 May 2000, Just van Rossum wrote:
> But then it's even better to *always* raise an exception, since it's
> entirely possible a string contains valid utf-8 while not *being* utf-8.
I believe it is time for me to make a truly radical proposal:
No automatic conversions between 8-bit "strings" and Unicode strings.
If you want to turn UTF-8 into a Unicode string, say so.
If you want to turn Latin-1 into a Unicode string, say so.
If you want to turn ISO-2022-JP into a Unicode string, say so.
Adding a Unicode string and an 8-bit "string" gives an exception.
I know this sounds tedious, but at least it stands the least possible
chance of confusing anyone -- and given all i've seen here and in
other i18n and l10n discussions, there's plenty enough confusion to
go around already.
If it turns out automatic conversions *are* absolutely necessary,
then i vote in favour of the simple, direct method promoted by Paul
and Fredrik: just copy the numerical values of the bytes. The fact
that this happens to correspond to Latin-1 is not really the point;
the main reason is that it satisfies the Principle of Least Surprise.
Okay. Feel free to yell at me now.
P. S. The scare-quotes when i talk about 8-bit "strings" expose my
sense of them as byte-buffers -- since that *is* all you get when you
read in some bytes from a file. If you manipulate an 8-bit "string"
as a character string, you are implicitly making the assumption that
the byte values correspond to the character encoding of the character
repertoire you want to work with, and that's your responsibility.
P. P. S. If always having to specify encodings is really too much,
i'd probably be willing to consider a default-encoding state on the
Unicode class, but it would have to be a stack of values, not a