[Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern robert.kern at gmail.com
Wed Apr 26 20:17:29 EDT 2017

On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:
> * Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.
> For THAT -- utf-8 is critical.
> But if I understand Julian's proposal -- he wants to create a
parameterized text dtype that you can set the encoding on, and then numpy
will use the encoding (and python's machinery) to encode / decode when
passing to/from python strings.
> It seems this would support all our desires:
> I'd get a latin-1 encoded type for compact representation of mostly-ascii
> Thomas would get latin-1 for binary interchange with mostly-ascii data
> The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)
> Even folks that had weird JAVA or Windows-generated UTF-16 data files
could do the binary interchange thing....
> I'm now lost as to what the hang-up is.

The proposal is for only latin-1 and UTF-32 to be supported at first, and
the eventual support of UTF-8 will be constrained by specification of the
width in terms of characters rather than bytes, which conflicts with the
use cases of UTF-8 that have been brought forth.


Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/bb1993ec/attachment.html>

More information about the NumPy-Discussion mailing list