[I18n-sig] Re: [Python-Dev] Unicode debate

Tue, 2 May 2000 00:29:41 +0200

Guido van Rossum <guido@python.org> wrote:
> I just wish he made the point more eloquently.  The eff-bot seems to
> be in a crunchy mood lately...

I've posted a few thousand messages on this topic, most of which
seem to have been ignored.  if you'd read all my messages, and seen
all the replies, you'd be cranky too...

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

maybe, but it's a darn good axiom, and it's used by everyone else.
Perl uses it, Tcl uses it, XML uses it, etc.  see:

http://www.python.org/pipermail/python-dev/2000-April/005218.html

> I have a bunch of good reasons (I think) for liking UTF-8: it allows
> you to convert between Unicode and 8-bit strings without losses, Tcl
> uses it (so displaying Unicode in Tkinter *just* *works*...), it is
> not Western-language-centric.

the "Tcl uses it" is a red herring -- their internal implementation
uses 16-bit integers, and the external interface works very hard
to keep the "strings are character sequences" illusion.

in other words, the length of a string is *always* the number of
characters, the character at index i is *always* the i'th character
in the string, etc.

that's not true in Python 1.6a2.

(as for Tkinter, you only have to add 2-3 lines of code to make it
use 16-bit strings instead...)

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

this is another red herring: my argument is that 8-bit strings should
contain unicode characters, using unicode character codes.  there
should be only one character repertoire, and that repertoire is uni-
code.  for a definition of these terms, see:

http://www.python.org/pipermail/python-dev/2000-April/005225.html

obviously, you can only store 256 different values in a single 8-bit
character (just like you can only store 4294967296 different values
in a single 32-bit int).

to store larger values, use unicode strings (or long integers).

conversion from a small type to a large type always work, conversion
from a large type to a small one may result in an OverflowError.

it has nothing to do with encodings.

> I claim that as long as we're using an encoding we might as well use
> the most accepted 8-bit encoding of Unicode as the default encoding.

yeah, and I claim that it won't fly, as long as it breaks the "strings
are character sequences" rule used by all other contemporary (and
competing) systems.

(if you like, I can post more "fun with unicode" messages ;-)

and as I've mentioned before, there are (at least) two ways to solve
this:

1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and
   Perl).  make sure len(s) returns the number of characters in the
   string, make sure s[i] returns the i'th character (not necessarily
   starting at the i'th byte, and not necessarily one byte), etc.  to
   make this run reasonable fast, use as many implementation tricks
   as you can come up with (I've described three ways to implement
   this in an earlier post).

2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i])
   is a unicode character code, whether s is an 8-bit string or a =
unicode
   string.

for alternative 1 to work, you need to add some way to explicitly work
with binary strings (like it's done in Perl and Tcl).

alternative 2 doesn't need that; 8-bit strings can still be used to hold
any kind of binary data, as in 1.5.2.  just keep in mind you cannot use
use all methods on such an object...

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

I still think it's very unfortunate that you think that unicode strings
are a special kind of strings.  Perl and Tcl don't, so why should we?

</F>