[Python-Dev] Unicode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 17 May 2000 00:02:10 +0200

> perfectionist or not, I only want Python's Unicode support to
> be as intuitive as anything else in Python.  as it stands right
> now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I haven't much experience with Perl, but I don't think Tcl is
intuitive in this area. I really think that they got it all wrong.
They use the string type for "plain bytes", just as we do, but then
have the notion of "correct" and "incorrect" UTF-8 (i.e. strings with
violations of the encoding rule). For a "plain bytes" string, the
following might happen

- the string is scanned for non-UTF-8 characters
- if any are found, the string is converted into UTF-8, essentially
  treating the original string as Latin-1.
- it then continues to use the UTF-8 "version" of the original string,
  and converts it back on demand.

Maybe I got something wrong, but the Unicode support in Tcl makes me
worry very much.

> btw, I thought we'd all agreed on GvR's solution for 1.6?
> what did I miss?

I like the 'only ASCII is converted' approach very much, so I'm not
objecting to that solution - just as I wasn't objecting to the
previous one.

> so tell me, if "good enough" is what we're aiming at, why isn't
> my counter-proposal good enough?

Do you mean the one in


which I suppose is the same one as the "java-like approach"? AFAICT,
all it does is to change the default encoding from UTF-8 to Latin-1.
I can't follow why this should be *better*, but it would be certainly
as good... In comparison, restricting the "character" interpretation
of the string type (in terms of your proposal) to 7-bit characters
has the advantage that it is less error-prone, as Guido points out.