[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 27 12:57:03 EDT 2017

2017-04-27 18:18 GMT+02:00 Chris Barker <chris.barker at noaa.gov>:

> On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet at gmail.com> wrote:
>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back
>>
>
> This is the key point -- we can argue all we want about the best encoding
> for fixed-length unicode-supporting strings (I think numpy and HDF have
> very similar requirements), but that is not our decision to make -- many
> other systems have chosen utf-8, so it's a really good idea for numpy to be
> able to deal with that cleanly and easily and consistently.
>

Agreed.  But it would also be a good idea to spread the word that simple
UCS4 encoding in combination with compression can be a perfectly good
system for storing large amounts of unicode data too.

>
> I have made many anti utf-8 points in this thread because while we need to
> deal with utf-8 for interplay with other systems, I am very sure that it is
> not the best format for a default, naive-user-of-numpy unicode-supporting
> dtype. Nor is it the best encoding for a mostly-ascii compact in memory
> format.
>

I resonate a lot with this feeling too :)

>
> So I think numpy needs to support at least:
>
> utf-8
> latin-1
> UCS-4
>
> And it maybe should support one-byte encoding suitable for non-european
> languages, and maybe utf-16 for Java and Windows compatibility, and ....
>
> So that seems to point to "support as many encodings as possible" And
> python has the machinery to do so -- so why not?
>
> (I'm taking Julian's word for it that having a parameterized dtype would
> not have a major impact on current code)
>
> If we go with a parameterized by encoding string dtype, then we can pick
> sensible defaults, and let users use what they know best fits their
> use-cases.
>
> As for python2 -- it is on the way out, I think we should keep the 'U' and
> 'S' dtypes as they are for backward compatibility and move forward with the
> new one(s) in a way that is optimized for py3. And it would map to a py2
> Unicode type.
>
> The only catch I see in that is what to do with bytes -- we should have a
> numpy dtype that matches the bytes model -- fixed length bytes that map to
> python bytes objects. (this is almost what teh void type is yes?) but then
> under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
> bytes objects??
>
> @Francesc: -- one more question for you:
>
> How important is it for pytables to match the numpy storage to the hdf
> storage byte for byte? i.e. would it be a killer if encoding / decoding
> happened every time at the boundary? I'm guessing yes, as this would have
> been solved long ago if not.
>

The PyTables team decided some time ago that it was a waste of time and
resources to maintain the internal HDF5 interface, and that it would be
better to switch to h5py for the low I/O communication with HDF5 (btw, we
just received a small NumFOCUS grant for continue the ongoing work on
this; thanks guys!).  This means that PyTables will be basically agnostic
about this sort of encoding issues, and that the important package to have
in account for interfacing NumPy and HDF5 is just h5py.

-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/52ef694c/attachment.html>