[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 15:34:24 EDT 2017

I suggest a new data type  'text[encoding]', 'T'.

1. text can be cast to python strings via decoding.

2. Conceptually casting to python bytes first cast to a string then
calls encode(); the current encoding in the meta data is used by
default, but the new encoding can be overridden.

I slightly favour 'T16' as a fixed size, text record backed by 16
bytes. This way over-allocation is forcefully delegated to the user,
simplifying numpy array.

Yu

On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern at gmail.com>
>> wrote:
>>>
>>> I don't know of a format off-hand that works with numpy uniform-length
>>> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
>>> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
>>> UTF8 strings.
>>
>>
>> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
>> variable length versions:
>> https://github.com/PyTables/PyTables/issues/499
>> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>>
>> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
>> storage, not the number of characters.
>
> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
> of course are also constrained by numpy's current set of dtypes. The
> NULL-terminated ASCII works well enough with np.string's semantics.
>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>