[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 15:30:27 EDT 2017

On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
> >
> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
> peridot.faceted at gmail.com> wrote:
>
> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
> >
> >
> > Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> Just to clarify some terminology (because it wasn't originally clear to me
> until I looked it up in reference to HDF5):
>
> * "NULL-padded" implies that, for a fixed width of N, there can be up to N
> non-NULL bytes. Any extra space left over is padded with NULLs, but no
> space needs to be reserved for NULLs.
>
> * "NULL-terminated" implies that, for a fixed width of N, there can be up
> to N-1 non-NULL bytes. There must always be space reserved for the
> terminating NULL.
>
> I'm not really sure if "NULL-padded" also specifies the behavior for
> embedded NULLs. It's certainly possible to deal with them: just strip
> trailing NULLs and leave any embedded ones alone. But I'm also sure that
> there are some implementations somewhere that interpret the requirement as
> "stop at the first NULL or the end of the fixed width, whichever comes
> first", effectively being NULL-terminated just not requiring the reserved
> space.
>

Thanks for the clarification. NULL-padded is what I meant.

I'm wondering how much of the desired functionality we could get by simply
subclassing ndarray in python. I think we mostly want to be able to view
byte strings and convert to unicode if needed.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/05a298b3/attachment-0001.html>