[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 15:29:22 EDT 2017

On Apr 25, 2017 11:53 AM, "Robert Kern" <robert.kern at gmail.com> wrote:

On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted at gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.

And to save anyone else having to check, numpy's current NUL-padded dtypes
only strip trailing NULs, so they can round-trip strings that contain NULs,
just not strings where NUL is the last character.

So the set of strings representable by str/bytes is a strict superset of
the set of strings representable by numpy U/S dtypes, which in turn is a
strict superset of the set of strings representable by a hypothetical
NUL-terminated dtype.

(Of course this doesn't matter for most practical purposes, because people
rarely make strings with embedded NULs.)

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/872e5537/attachment.html>