[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

June 30, 2020


      On 6/30/20 8:43 AM, Emily Bowman wrote:
...
I completely agree with this, that UTF-8 has become the One True
Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside
of the Win32 API. Nearly all basic emoji can't be represented in UCS-2
wchar_t, let alone composite emoji.
So how to make that C-compatible? Make everything a void* and it just
comes back with as many bytes as it gets?
Actually, in C you would tend to represent UTF-8 as a char* (or maybe an
unsigned char*) type. This points out that straight 'ASCII' strings are
also UTF-8, and that many of the string functions will actually work ok
with UTF-8 strings. This was an intentional part of the design of UTF-8.
Anything looking for specific character values will tend to 'just work',
as long as those values really represent a character. The code also
needs to take account of that now bytes != characters, so if you want to
actually count how many characters are in a string, you need to be
aware, and avoid splitting a string in the middle of a code-point, but a
lot will still just work.

-- 
Richard Damon