<div dir="ltr">On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald <<a href="mailto:peridot.faceted@gmail.com">peridot.faceted@gmail.com</a>> wrote:<br>><br>> On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer <<a href="mailto:shoyer@gmail.com">shoyer@gmail.com</a>> wrote:<br>>><br>>> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <<a href="mailto:robert.kern@gmail.com">robert.kern@gmail.com</a>> wrote:<br>>>><br>>>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <<a href="mailto:charlesr.harris@gmail.com">charlesr.harris@gmail.com</a>> wrote:<br>>>><br>>>> > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.<br>>>><br>>>> The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters.<br>>><br>>> It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text:<br>>> <a href="http://utf8everywhere.org/#myths">http://utf8everywhere.org/#myths</a><br>>><br>>> I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them.<br>><br>> It seems to me that most of the requirements people have expressed in this thread would be satisfied by:<br>><br>> (1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.)<br>><br>> (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy. I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4. This also includes the legacy UCS4 strings as a special case. <br>><br>> (3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense.<br><br>The void dtype is already there for this general purpose and mostly works, with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the scalars (fortunately, they do not appear to be mutable views). It also accepts `bytes` strings that are too short (pads with NULs) and too long (truncates). If it worked more transparently and perhaps rigorously with `bytes`, then it would be quite suitable.<br><br>--<br>Robert Kern</div>