On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer <shoyer@gmail.com> wrote:
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters.
It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths
I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them.
It seems to me that most of the requirements people have expressed in this thread would be satisfied by: (1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.) (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy. I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4. This also includes the legacy UCS4 strings as a special case. (3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense. There seems to be considerable debate about what the "default" string type should be, but since users must specify a length anyway, might as well force them to specify an encoding and thus dodge the debate about the right default. The other question - which I realize is how the thread started - is what to do about backward compatibility. I'm not writing the code, so my opinion doesn't matter much, but I think we're stuck maintaining what we have now - ASCII and UCS4 strings - for a while yet. But it can be deprecated, or they can be simply reimplemented as shorthand names for ASCII- or UCS4-encoded strings in the bytes-with-encoding dtype. Anne