[Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern robert.kern at gmail.com
Tue Apr 25 14:52:08 EDT 2017

On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted at gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved

Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/4ebefacd/attachment.html>

More information about the NumPy-Discussion mailing list