[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 27 12:18:47 EDT 2017

On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet at gmail.com> wrote:

> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back
>

This is the key point -- we can argue all we want about the best encoding
for fixed-length unicode-supporting strings (I think numpy and HDF have
very similar requirements), but that is not our decision to make -- many
other systems have chosen utf-8, so it's a really good idea for numpy to be
able to deal with that cleanly and easily and consistently.

I have made many anti utf-8 points in this thread because while we need to
deal with utf-8 for interplay with other systems, I am very sure that it is
not the best format for a default, naive-user-of-numpy unicode-supporting
dtype. Nor is it the best encoding for a mostly-ascii compact in memory
format.

So I think numpy needs to support at least:

utf-8
latin-1
UCS-4

And it maybe should support one-byte encoding suitable for non-european
languages, and maybe utf-16 for Java and Windows compatibility, and ....

So that seems to point to "support as many encodings as possible" And
python has the machinery to do so -- so why not?

(I'm taking Julian's word for it that having a parameterized dtype would
not have a major impact on current code)

If we go with a parameterized by encoding string dtype, then we can pick
sensible defaults, and let users use what they know best fits their
use-cases.

As for python2 -- it is on the way out, I think we should keep the 'U' and
'S' dtypes as they are for backward compatibility and move forward with the
new one(s) in a way that is optimized for py3. And it would map to a py2
Unicode type.

The only catch I see in that is what to do with bytes -- we should have a
numpy dtype that matches the bytes model -- fixed length bytes that map to
python bytes objects. (this is almost what teh void type is yes?) but then
under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
bytes objects??

@Francesc: -- one more question for you:

How important is it for pytables to match the numpy storage to the hdf
storage byte for byte? i.e. would it be a killer if encoding / decoding
happened every time at the boundary? I'm guessing yes, as this would have
been solved long ago if not.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/faafcfdf/attachment.html>