[Python-ideas] Processing surrogates in

Wed May 13 20:18:44 CEST 2015

On May 13, 2015, at 07:33, random832 at fastmail.us wrote:
> 
>> On Thu, May 7, 2015, at 18:30, Stephen J. Turnbull wrote:
>> Chris Barker writes:
>> 
>>> I've read many of the rants about UTF-16, but in fact, it's really
>>> not any worse than UTF-8
>> 
>> Yes, it is.  It's not ASCII compatible.  You can safely use the usual
>> libc string APIs on UTF-8 (except for any that might return only part
>> of a string), but not on UTF-16 (nulls).  This is a pretty big
>> advantage for UTF-8 in practice.
> 
> If you're using libc, why shouldn't you be using the native wide
> character types (whether that it UTF-16 or UCS-4) and using the wide
> string APIs?

That's exactly how you create the problems this thread is trying to solve.

If you treat wchar_t as a "native wide char type" and call any of the wcs functions on UTF-16 strings, you will count astral characters as two characters, illegally split strings in the middle of surrogates, etc. And you'll count BOMs as two characters and split them. These are basically all the same problems you have using char with UTF-8, and more, and harder to notice in testing (not just because you may not think to test for astral characters, but because even if you do, you may not think to test both byte orders).

And that's not even taking into account the fact that C explicitly allows wchar_t to be as small as 8 bits.

The Unicode and C standards both explicitly say that you should never use wchar_t for Unicode characters in portable code, only use it for storing the native characters of any wider-than-char locale encodings that a specific compiler supports.

Later versions of C and POSIX (as in later than what Python requires) provide explicit __CHAR16_TYPE__ and __CHAR_32_TYPE__, but they don't provide APIs for analogs of strlen, strchr, strtok, etc. for those types, so you have to be explicit about whether you're counting code points or characters (and, if characters, how you're dealing with endianness).