[Python-ideas] Processing surrogates in

Thu May 14 16:45:50 CEST 2015

On Wed, May 13, 2015, at 13:45, Stephen J. Turnbull wrote:
> random832 at fastmail.us writes:
> 
>  > If you're using libc, why shouldn't you be using the native wide
>  > character types (whether that it UTF-16 or UCS-4) and using the wide
>  > string APIs?
> 
> Who says you are using libc?

If you're not using libc, then "You can safely use the usual libc string
APIs" is not a benefit.

> You might be writing an operating system
> or a shell script.  And if you do use the native wide character type,
> you're guaranteed not to be portable because some systems have wide
> characters are actually variable width and others aren't, as you just
> pointed out.  Or you might have an ancient byte-oriented program you
> want to use.

Using UTF-8 *without* ensuring that the native multibyte character set
is UTF-8 [by setting the locale appropriately] and that it is supported
end-to-end (by your program, by the curses library if applicable, by the
terminal if applicable) just turns obvious problems into subtle ones -
not exactly an improvement.

> I'm not saying that UTF-8 is a panacea; just that every problem that
> UTF-8 has, UTF-16 also has -- but UTF-16 does have problems that UTF-8
> doesn't.  Specifically, surrogates and ASCII incompatibility.

ASCII incompatibility is a feature, not a bug - it prevents you from
doing stupid things that cause subtle bugs.

On Wed, May 13, 2015, at 14:18, Andrew Barnert wrote:
> That's exactly how you create the problems this thread is trying to
> solve.

The point I was getting at was more "you can't benefit from libc
functions at all, therefore your argument for UTF-8 is bad" than "you
should be using the native wchar_t type". Libc only has functions to
deal with native char strings [but these do not generally count
characters or respect character boundaries in multibyte character sets
even if UTF-8 *is* the native multibyte character set] and native
wchar_t strings, not any other kind of string.

> 
> If you treat wchar_t as a "native wide char type" and call any of the wcs
> functions on UTF-16 strings, you will count astral characters as two
> characters, illegally split strings in the middle of surrogates, etc.

No worse than UTF-8. If you can solve these problems for UTF-8 you can
solve them for UTF-16.

> And
> you'll count BOMs as two characters and split them.

Wait, what? The BOM is a single code unit in UTF-16. There is *no*
encoding in which a BOM is two code units (it's three in UTF-8). Anyway,
BOM shouldn't be used for in-memory strings, only text files.

> These are basically
> all the same problems you have using char with UTF-8, and more, and
> harder to notice in testing (not just because you may not think to test
> for astral characters, but because even if you do, you may not think to
> test both byte orders).

Byte orders are not an issue for anything other than file I/O, and I'm
not proposing using any type other than UTF-8 for *text files*, anyway,
only in-memory strings.

> Later versions of C and POSIX (as in later than what Python requires)
> provide explicit __CHAR16_TYPE__ and __CHAR_32_TYPE__, but they don't
> provide APIs for analogs of strlen, strchr, strtok, etc. for those types,
> so you have to be explicit about whether you're counting code points or
> characters (and, if characters, how you're dealing with endianness).

There are no analogs of these for UTF-8 either. And endianness is not an
issue for in-memory strings stored using any of these types.