[Python-Dev] Unicode debate

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Wed, 3 May 2000 09:48:56 +0200


Tim Peters <tim_one@email.msn.com> wrote:
> [Moshe Zadka]
> > ...
> > I'd much prefer Python to reflect a fundamental truth about Unicode,
> > which at least makes sure binary-goop can pass through Unicode and
> > remain unharmed, then to reflect a nasty problem with UTF-8 (not
> > everything is legal).
>=20
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences (for example, =
0xffff
> is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
> Unicode not yet having assigned a character to this position, it's =
that the
> standard explicitly makes this sequence illegal and guarantees it will
> always be illegal!

in context, I think what Moshe meant was that with a straight
character code mapping, any 8-bit string can always be mapped
to a unicode string and back again.

given a byte array "b":

    u =3D unicode(b, "default")
    assert map(ord, u) =3D=3D map(ord, s)

again, this is no different from casting an integer to a long integer
and back again.  (imaging having to do that on the bits and bytes
level!).

and again, the internal unicode encoding used by the unicode string
type itself, or when serializing that string type, has nothing to do
with that.

</F>