On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern@gmail.com> wrote:
>>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space.
>
> Thanks for the clarification. NULL-padded is what I meant.
Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is NULL-terminated.
> I'm wondering how much of the desired functionality we could get by simply subclassing ndarray in python. I think we mostly want to be able to view byte strings and convert to unicode if needed.
I'm not sure. Some of these fixed-width string arrays are embedded inside structured arrays with other dtypes.
--
Robert Kern