[Python-Dev] Unicode byte order mark decoding

Stephen J. Turnbull stephen at xemacs.org
Fri Apr 8 04:22:50 CEST 2005


>>>>> "MvL" == "Martin v. Löwis" <martin at v.loewis.de> writes:

    MvL> This would also support your usecase, and in a better way.
    MvL> The Unicode assertion that UTF-16 is BE by default is void
    MvL> these days - there is *always* a higher layer protocol, and
    MvL> it more often than not specifies (perhaps not in English
    MvL> words, but only in the source code of the generator) that the
    MvL> default should by LE.

That is _not_ a protocol.  A protocol is a published specification,
not merely a frequent accident of implementation.  Anyway, both ISO
10646 and the Unicode standard consider that "internal use" and there
is no requirement at all placed on those data.  And such generators
typically take great advantage of that freedom---have you looked in a
.doc file recently?  Have you noticed how many different options
(previous implementations) of .doc are offered in the Import menu?

>>>>> "MAL" == "M.-A. Lemburg" <mal at egenix.com> writes:

    MAL> I've checked the various versions of the Unicode standard
    MAL> docs: it seems that the quote you have was silently
    MAL> introduced between 3.0 and 4.0.

Probably because ISO 10646 was _always_ BE until the standards were
unified.  But note that ISO 10646 standardizes only use as a
communications medium.  Neither ISO 10646 nor Unicode makes any
specification about internal usage.  Conformance in internal
processing is a matter of the programmer's convenience in producing
conforming output.

    MAL> Python currently uses version 3.2.0 of the standard and I
    MAL> don't think enough people are aware of the change in the
    MAL> standard

There's only one (corporate) person that matters: Microsoft.

    MAL> By the time we switch to 4.1 or later, we can then make the
    MAL> change in the native UTF-16 codec as you requested.

While in principle I sympathize with Nick, pragmatically Microsoft is
unlikely to conform.  They will take the position that files created
by Windows are "internal" to the Windows environment, except where
explicitly intended for exchange with arbitrary platforms, and only
then will they conform.  As Martin points out, that is what really
matters for these defaults.  I think you should look to see what
Microsoft does.

    MAL> Personally, I think that the Unicode consortium should not
    MAL> have introduced a default for the UTF-16 encoding byte
    MAL> order. Using big endian as default in a world where most
    MAL> Unicode data is created on little endian machines is not very
    MAL> realistic either.

It's not a default for the UTF-16 encoding byte order.  It's a default
for the UTF-16 encoding byte order _when UTF-16 is a communications
medium_.  Given that the generic network byte order is bigendian, I
think it would be insane to specify littleendian as Unicode's default.

With Unicode same as network, you specify UTF-16 strings internally as
an array of uint16_t, and when you put them on the wire (including
saving them to a file that might be put on the wire as octet-stream)
you apply htons(3) to it.  On reading, you apply ntohs(3) to it.  The
source code is portable, the file is portable.  How can you beat that?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list