On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern@gmail.com> wrote:

On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

> The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.

The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters.

It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text:

I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them.