[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 15:36:27 EDT 2017

On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted at gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to
me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to
N non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be
up to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.
>
> Thanks for the clarification. NULL-padded is what I meant.

Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is
NULL-terminated.

> I'm wondering how much of the desired functionality we could get by
simply subclassing ndarray in python. I think we mostly want to be able to
view byte strings and convert to unicode if needed.

I'm not sure. Some of these fixed-width string arrays are embedded inside
structured arrays with other dtypes.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/a5b1afeb/attachment.html>