[Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern robert.kern at gmail.com
Wed Apr 26 00:20:46 EDT 2017


On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <charlesr.harris at gmail.com>
wrote:

> The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

The major use case that we have for a UTF-8 array is HDF5, and it specifies
the width in bytes, not Unicode characters.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/6817b4d3/attachment.html>


More information about the NumPy-Discussion mailing list