[Python-ideas] Processing surrogates in

Thu May 14 22:17:15 CEST 2015

On May 14, 2015, at 07:45, random832 at fastmail.us wrote:

[snipping reply to Stephen J. Turnbull]

>> On Wed, May 13, 2015, at 14:18, Andrew Barnert wrote:
>> That's exactly how you create the problems this thread is trying to
>> solve.
> 
> The point I was getting at was more "you can't benefit from libc
> functions at all, therefore your argument for UTF-8 is bad" than "you
> should be using the native wchar_t type".

I'm not sure is this was Stephen's point, but _my_ point is not that it's easier to use UTF-16 incorrectly, but rather that it's just as easy to do, and much more likely to get through unit testing and lead to a later debugging nightmare when you do. The only bug that's easier to catch with UTF-16 is the incredibly obvious "why am I only seeing the first character of my filename" bug.

> Libc only has functions to
> deal with native char strings [but these do not generally count
> characters or respect character boundaries in multibyte character sets
> even if UTF-8 *is* the native multibyte character set] and native
> wchar_t strings, not any other kind of string.
> 
>> 
>> If you treat wchar_t as a "native wide char type" and call any of the wcs
>> functions on UTF-16 strings, you will count astral characters as two
>> characters, illegally split strings in the middle of surrogates, etc.
> 
> No worse than UTF-8. If you can solve these problems for UTF-8 you can
> solve them for UTF-16.
> 
>> And
>> you'll count BOMs as two characters and split them.
> 
> Wait, what? The BOM is a single code unit in UTF-16.

Sorry, that "two" was a stupid typo (or braino) for "one", which then changes the meaning of the rest of the paragraph badly.

The point is that you can miscount lengths by counting the BOM, and you can split a BOM stream into a BOM steam and an "I hope it's in native order or we're screwed" stream.

> There is *no*
> encoding in which a BOM is two code units (it's three in UTF-8). Anyway,
> BOM shouldn't be used for in-memory strings, only text files.

In a language with StringIO and socket.makefile and FTP and HTTP requests as transparent file-like objects and a slew of libraries that can take an open binary or text file or a bytes or str, that last point doesn't work as well.

For example, if I pass a binary file to you library's spam.parse function, I can expect that to be the same as reading the binary file and passing it to your spam.fromstring function. So, I may expect to be able to, say, re.split the document into smaller documents and pass them to spam.fromstring as well. Which is wrong, but it works when I test it, because most UTF-16 files are little-endian, and so is my machine. And then someone runs my app on a big-endian machine and they get a hard-to-debug exception (or, if we're really unlucky, silent mojibake, but that's pretty rare).

>> These are basically
>> all the same problems you have using char with UTF-8, and more, and
>> harder to notice in testing (not just because you may not think to test
>> for astral characters, but because even if you do, you may not think to
>> test both byte orders).
> 
> Byte orders are not an issue for anything other than file I/O, and I'm
> not proposing using any type other than UTF-8 for *text files*, anyway,
> only in-memory strings.

Why do you want to use UTF-16 for in-memory strings? If you need to avoid the problems of UTF-8 (and can't use a higher-level Unicode API like Python's str type), you can use UTF-32, which solves all of the problems, or you can use UTF-16, which solves almost none of them, but makes them less likely to be caught in testing.

There's a reason very new frameworks force you to use UTF-16 APIs and string types, only the ones that were originally written for UCS2 and it's too late to change (Win32, Cocoa, Java, and a couple others).

>> Later versions of C and POSIX (as in later than what Python requires)
>> provide explicit __CHAR16_TYPE__ and __CHAR_32_TYPE__, but they don't
>> provide APIs for analogs of strlen, strchr, strtok, etc. for those types,
>> so you have to be explicit about whether you're counting code points or
>> characters (and, if characters, how you're dealing with endianness).
> 
> There are no analogs of these for UTF-8 either. And endianness is not an
> issue for in-memory strings stored using any of these types.

Sure, if you've, say, explicitly encoded text to UTF-16-LE and want to treat it as UTF-16-LE, you don't need to worry about endianness; a WCHAR or char16_t is a WCHAR or char16_t. But why would you do that in the first place? 

Usually, when you have WCHARs, it's because you opened a file and wread from it, or received UTF-16 over the network of from a Windows FooW API, in which case you have the same endianness issues as any other binary I/O on non-char-sized types. And yes, of course the right answer is to decode at input, but if you're doing that, why wouldn't you just decide to Unicode instead of byte-swapping the WCHARs?