
On 6/30/20 8:43 AM, Emily Bowman wrote:
I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji.
So how to make that C-compatible? Make everything a void* and it just comes back with as many bytes as it gets?
Actually, in C you would tend to represent UTF-8 as a char* (or maybe an unsigned char*) type. This points out that straight 'ASCII' strings are also UTF-8, and that many of the string functions will actually work ok with UTF-8 strings. This was an intentional part of the design of UTF-8. Anything looking for specific character values will tend to 'just work', as long as those values really represent a character. The code also needs to take account of that now bytes != characters, so if you want to actually count how many characters are in a string, you need to be aware, and avoid splitting a string in the middle of a code-point, but a lot will still just work. -- Richard Damon