[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 13:07:55 EDT 2017

On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker <chris.barker at noaa.gov> wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.
>
> So I'll try again -- use-case only! we'll keep the possible solutions
separate.
>
> Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

... etc.

These aren't use cases but rather requirements. I'm looking for something
rather more concrete than that.

* HDF5 supports fixed-length and variable-length string arrays encoded in
ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
the documentation claiming that there are more options). In practice, the
ASCII strings permit high-bit characters, but the encoding is unspecified.
Memory-mapping is rare (but apparently possible). The two major HDF5
bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
HDF5 option. Compression is supported for fixed-length string arrays but
not variable-length string arrays.

* FITS supports fixed-length string arrays that are NULL-padded. The
strings do not have a formal encoding, but in practice, they are typically
mostly ASCII characters with the occasional high-bit character from an
unspecific encoding. Memory-mapping is a common practice. These arrays can
be quite large even if each scalar is reasonably small.

* pandas uses object arrays for flexible in-memory handling of string
columns. Lengths are not fixed, and None is used as a marker for missing
data. String columns must be written to and read from a variety of formats,
including CSV, Excel, and HDF5, some of which are Unicode-aware and work
with `unicode/str` objects instead of `bytes`.

* There are a number of sometimes-poorly-documented,
often-poorly-adhered-to, aging file format "standards" that include string
arrays but do not specify encodings, or such specification is ignored in
practice. This can make the usual "Unicode sandwich" at the I/O boundaries
difficult to perform.

* In Python 3 environments, `unicode/str` objects are rather more common,
and simple operations like equality comparisons no longer work between
`bytes` and `unicode/str`, making it difficult to work with numpy string
arrays that yield `bytes` scalars.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/2e69f47c/attachment.html>