[Numpy-discussion] proposal: smaller representation of string arrays

Anne Archibald peridot.faceted at gmail.com
Tue Apr 25 13:34:37 EDT 2017


On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern at gmail.com> wrote:

> * HDF5 supports fixed-length and variable-length string arrays encoded in
> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
> the documentation claiming that there are more options). In practice, the
> ASCII strings permit high-bit characters, but the encoding is unspecified.
> Memory-mapping is rare (but apparently possible). The two major HDF5
> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
> HDF5 option. Compression is supported for fixed-length string arrays but
> not variable-length string arrays.
>
> * FITS supports fixed-length string arrays that are NULL-padded. The
> strings do not have a formal encoding, but in practice, they are typically
> mostly ASCII characters with the occasional high-bit character from an
> unspecific encoding. Memory-mapping is a common practice. These arrays can
> be quite large even if each scalar is reasonably small.
>
> * pandas uses object arrays for flexible in-memory handling of string
> columns. Lengths are not fixed, and None is used as a marker for missing
> data. String columns must be written to and read from a variety of formats,
> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
> with `unicode/str` objects instead of `bytes`.
>
> * There are a number of sometimes-poorly-documented,
> often-poorly-adhered-to, aging file format "standards" that include string
> arrays but do not specify encodings, or such specification is ignored in
> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
> difficult to perform.
>
> * In Python 3 environments, `unicode/str` objects are rather more common,
> and simple operations like equality comparisons no longer work between
> `bytes` and `unicode/str`, making it difficult to work with numpy string
> arrays that yield `bytes` scalars.
>

It seems the greatest challenge is interacting with binary data from other
programs and libraries. If we were living entirely in our own data world,
Unicode strings in object arrays would generally be pretty satisfactory. So
let's try to get what is needed to read and write other people's formats.

I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map
directly to numpy arrays; we can store it however we want, as conversion is
necessary anyway.

Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other
packages are waiting specifically for it. But specifying this requires two
pieces of information: What is the encoding? and How is the length
specified? I know they're not numpy-compatible, but FITS header values are
space-padded; does that occur elsewhere? Are there other ways existing data
specifies string length within a fixed-size field? There are some
cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7,
etc. - but they are probably too specialized to need? We should make sure
we can support all the ways that actually occur.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/55de9e8a/attachment.html>


More information about the NumPy-Discussion mailing list