[Python-ideas] Processing surrogates in

Fri May 15 20:19:35 CEST 2015

On Thu, May 14, 2015, at 16:17, Andrew Barnert wrote:
> The point is that you can miscount lengths by counting the BOM, and you
> can split a BOM stream into a BOM steam and an "I hope it's in native
> order or we're screwed" stream.

Python provides no operations for splitting streams. You mention
re.split further on, but that only works on in-memory strings, which
should have already had the BOM stripped and been put in native order.
In-memory wide strings should _never_ be in an endianness other than the
machine's native one and should _never_ have a BOM. That should be taken
care of when reading it off the disk/wire. If you haven't done that, you
still have a byte array, which it's not so easy to accidentally assume
you'll be able to split up and pass to your fromstring function.

> Which is wrong, but it works when I test it,
> because most UTF-16 files are little-endian, and so is my machine. And
> then someone runs my app on a big-endian machine and they get a
> hard-to-debug exception (or, if we're really unlucky, silent mojibake,
> but that's pretty rare).

The proper equivalent of a UTF-16 file with a byte-order-mark would be a
_binary_ StringIO on a _byte_ array containing a BOM and UTF-16. You can
layer a TextIOWrapper on top of either of them. And it never makes sense
to expect to be able to arbitrarily split up encoded byte arrays,
whether those are in UTF-16 or not.

> Usually, when you have WCHARs, it's because you opened a file and wread
> from it, or received UTF-16 over the network of from a Windows FooW API,
> in which case you have the same endianness issues as any other binary I/O
> on non-char-sized types. And yes, of course the right answer is to decode
> at input, but if you're doing that, why wouldn't you just decide to
> Unicode instead of byte-swapping the WCHARs?

You shouldn't have WCHARS (of any kind) in the first place until you've
decoded. If you're receiving UTF-16 of unknown endianness over the
network you should be receiving it as bytes. If you're directly calling
a FooW API, you are obviously on a win32 system and you've already got
native WCHARs in native endianness. But, once again, that wasn't really
my point.

My point that there are no native libc functions for working with utf-8
strings - even if you're willing to presume that the native multibyte
character set is UTF-8, there are very few standard functions for
working with multibyte characters. "ascii compatibility" means you're
going to write something using strchr or strtok that works for ascii
characters and does something terrible when given non-ascii multibyte
characters to search for.

The benefits of using libc only work if you play by libc's rules, which
we've established are inadequate. If you're _not_ going to use libc
string functions, then there's no reason not to prefer UTF-32 (when
you're not using the FSR, which is essentially a fancy immutable
container for UTF-32 code points) over UTF-8.