[Python-Dev] Unicode

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Wed, 17 May 2000 09:36:03 +0200

Martin v. Loewis wrote:
> > perfectionist or not, I only want Python's Unicode support to
> > be as intuitive as anything else in Python.  as it stands right
> > now, Perl and Tcl's Unicode support is intuitive.  Python's not.
> I haven't much experience with Perl, but I don't think Tcl is
> intuitive in this area. I really think that they got it all wrong.

"all wrong"?

Tcl works hard to maintain the characters are characters model
(implementation level 2), just like Perl.  the length of a string is
always the number of characters, slicing works as it should, the
internal representation is as efficient as you can make it.

but yes, they have a somewhat dubious autoconversion mechanism
in there.  if something isn't valid UTF-8, it's assumed to be Latin-1.

scary, huh?  not really, if you step back and look at how UTF-8 was
designed.  quoting from RFC 2279:

    "UTF-8 strings can be fairly reliably recognized as such by a
    simple algorithm, i.e. the probability that a string of characters
    in any other encoding appears as valid UTF-8 is low, diminishing
    with increasing string length."

besides, their design is based on the plan 9 rune stuff.  that code
was written by the inventors of UTF-8, who has this to say:

    "There is little a rune-oriented program can do when given bad
    data except exit, which is unreasonable, or carry on. Originally
    the conversion routines, described below, returned errors when
    given invalid UTF, but we found ourselves repeatedly checking
    for errors and ignoring them. We therefore decided to convert
    a bad sequence to a valid rune and continue processing.

    "This technique does have the unfortunate property that con-
    verting invalid UTF byte strings in and out of runes does not
    preserve the input, but this circumstance only occurs when
    non-textual input is given to a textual program."

so let's see: they aimed for a high level of unicode support (layer
2, stream encodings, and system api encodings, etc), they've based
their design on work by the inventors of UTF-8, they have several
years of experience using their implementation in real life, and you
seriously claim that they got it "all wrong"?

that's weird.

> AFAICT, all it does is to change the default encoding from UTF-8
> to Latin-1.

now you're using "all" in that strange way again...  check the archives
for the full story (hint: a conceptual design model isn't the same thing
as a C implementation)

> I can't follow why this should be *better*, but it would be certainly
> as good... In comparison, restricting the "character" interpretation
> of the string type (in terms of your proposal) to 7-bit characters
> has the advantage that it is less error-prone, as Guido points out.

the main reason for that is that Python 1.6 doesn't have any way to
specify source encodings.  add that, so you no longer have to guess
what a string *literal* really is, and that problem goes away.  but
that's something for 1.7.