[Python-Dev] Unicode

Andrew M. Kuchling akuchlin@mems-exchange.org
Tue, 16 May 2000 16:10:22 -0400 (EDT)

Fredrik Lundh writes:
>perfectionist or not, I only want Python's Unicode support to
>be as intuitive as anything else in Python.  as it stands right
>now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I don't know about Tcl, but Perl 5.6's Unicode support is still
considered experimental.  Consider the following excerpts, for
example.  (And Fredrik's right; we shouldn't release a 1.6 with broken
support, or we'll pay for it for *years*...  But if GvR's ASCII
proposal is considered OK, then great!)


>Ah, yes. Unicode. But after two years of work, the one thing that users
>will want to do - open and read Unicode data - is still not there.
>Who cares if stuff's now represented internally in Unicode if they can't
>read the files they need to.

This is a "big" (as in "huge") disappointment for me as well.  I hope
we'll do better next time.

But given that interpretation, I'm amazed at how many operators seem
to be broken with UTF8.    It certainly supports Ilya's contention of

Here's another example:
  DB<1> x (256.255.254 . 257.258.259) eq (
0  ''

Rummaging with Devel::Peek shows that in this case, it's the fault of
the . operator.

And eq is broken as well:

  DB<11> x "\x{100}" eq "\xc4\x80"
0  1



A couple problems here...passage through a hash key removes the UTF8
flag (as might be expected).  Even if keys were to attempt to restore
the UTF8 flag (ala Convert::UTF::decode_utf8) or hash keys were real
SVs, what then do you do with $h{"\304\254"} and the like?


1. Leave things as they are, but document UTF8 hash keys as experimental
and subject to change.

or 2. When under use bytes, leave things as they are.  Otherwise, have
keys turn on the utf8 flag if appropriate.  Also give a warning when
using a hash key like "\304\254" since keys will in effect return a
different string that just happens to have the same interal encoding.