"MvL" == "Martin v. Löwis" firstname.lastname@example.org writes:
MvL> This would also support your usecase, and in a better way. MvL> The Unicode assertion that UTF-16 is BE by default is void MvL> these days - there is *always* a higher layer protocol, and MvL> it more often than not specifies (perhaps not in English MvL> words, but only in the source code of the generator) that the MvL> default should by LE.
That is _not_ a protocol. A protocol is a published specification, not merely a frequent accident of implementation. Anyway, both ISO 10646 and the Unicode standard consider that "internal use" and there is no requirement at all placed on those data. And such generators typically take great advantage of that freedom---have you looked in a .doc file recently? Have you noticed how many different options (previous implementations) of .doc are offered in the Import menu?
"MAL" == "M.-A. Lemburg" email@example.com writes:
MAL> I've checked the various versions of the Unicode standard MAL> docs: it seems that the quote you have was silently MAL> introduced between 3.0 and 4.0.
Probably because ISO 10646 was _always_ BE until the standards were unified. But note that ISO 10646 standardizes only use as a communications medium. Neither ISO 10646 nor Unicode makes any specification about internal usage. Conformance in internal processing is a matter of the programmer's convenience in producing conforming output.
MAL> Python currently uses version 3.2.0 of the standard and I MAL> don't think enough people are aware of the change in the MAL> standard
There's only one (corporate) person that matters: Microsoft.
MAL> By the time we switch to 4.1 or later, we can then make the MAL> change in the native UTF-16 codec as you requested.
While in principle I sympathize with Nick, pragmatically Microsoft is unlikely to conform. They will take the position that files created by Windows are "internal" to the Windows environment, except where explicitly intended for exchange with arbitrary platforms, and only then will they conform. As Martin points out, that is what really matters for these defaults. I think you should look to see what Microsoft does.
MAL> Personally, I think that the Unicode consortium should not MAL> have introduced a default for the UTF-16 encoding byte MAL> order. Using big endian as default in a world where most MAL> Unicode data is created on little endian machines is not very MAL> realistic either.
It's not a default for the UTF-16 encoding byte order. It's a default for the UTF-16 encoding byte order _when UTF-16 is a communications medium_. Given that the generic network byte order is bigendian, I think it would be insane to specify littleendian as Unicode's default.
With Unicode same as network, you specify UTF-16 strings internally as an array of uint16_t, and when you put them on the wire (including saving them to a file that might be put on the wire as octet-stream) you apply htons(3) to it. On reading, you apply ntohs(3) to it. The source code is portable, the file is portable. How can you beat that?