[Numpy-discussion] proposal: smaller representation of string arrays
Chris Barker
chris.barker at noaa.gov
Thu Apr 27 12:18:47 EDT 2017
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet at gmail.com> wrote:
> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back
>
This is the key point -- we can argue all we want about the best encoding
for fixed-length unicode-supporting strings (I think numpy and HDF have
very similar requirements), but that is not our decision to make -- many
other systems have chosen utf-8, so it's a really good idea for numpy to be
able to deal with that cleanly and easily and consistently.
I have made many anti utf-8 points in this thread because while we need to
deal with utf-8 for interplay with other systems, I am very sure that it is
not the best format for a default, naive-user-of-numpy unicode-supporting
dtype. Nor is it the best encoding for a mostly-ascii compact in memory
format.
So I think numpy needs to support at least:
utf-8
latin-1
UCS-4
And it maybe should support one-byte encoding suitable for non-european
languages, and maybe utf-16 for Java and Windows compatibility, and ....
So that seems to point to "support as many encodings as possible" And
python has the machinery to do so -- so why not?
(I'm taking Julian's word for it that having a parameterized dtype would
not have a major impact on current code)
If we go with a parameterized by encoding string dtype, then we can pick
sensible defaults, and let users use what they know best fits their
use-cases.
As for python2 -- it is on the way out, I think we should keep the 'U' and
'S' dtypes as they are for backward compatibility and move forward with the
new one(s) in a way that is optimized for py3. And it would map to a py2
Unicode type.
The only catch I see in that is what to do with bytes -- we should have a
numpy dtype that matches the bytes model -- fixed length bytes that map to
python bytes objects. (this is almost what teh void type is yes?) but then
under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
bytes objects??
@Francesc: -- one more question for you:
How important is it for pytables to match the numpy storage to the hdf
storage byte for byte? i.e. would it be a killer if encoding / decoding
happened every time at the boundary? I'm guessing yes, as this would have
been solved long ago if not.
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/faafcfdf/attachment.html>
More information about the NumPy-Discussion
mailing list