[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 14:18:57 EDT 2017

On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted at gmail.com>
wrote:

>
> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern at gmail.com> wrote:
>
>> * HDF5 supports fixed-length and variable-length string arrays encoded in
>> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
>> the documentation claiming that there are more options). In practice, the
>> ASCII strings permit high-bit characters, but the encoding is unspecified.
>> Memory-mapping is rare (but apparently possible). The two major HDF5
>> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
>> HDF5 option. Compression is supported for fixed-length string arrays but
>> not variable-length string arrays.
>>
>> * FITS supports fixed-length string arrays that are NULL-padded. The
>> strings do not have a formal encoding, but in practice, they are typically
>> mostly ASCII characters with the occasional high-bit character from an
>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>> be quite large even if each scalar is reasonably small.
>>
>> * pandas uses object arrays for flexible in-memory handling of string
>> columns. Lengths are not fixed, and None is used as a marker for missing
>> data. String columns must be written to and read from a variety of formats,
>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>> with `unicode/str` objects instead of `bytes`.
>>
>> * There are a number of sometimes-poorly-documented,
>> often-poorly-adhered-to, aging file format "standards" that include string
>> arrays but do not specify encodings, or such specification is ignored in
>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>> difficult to perform.
>>
>> * In Python 3 environments, `unicode/str` objects are rather more common,
>> and simple operations like equality comparisons no longer work between
>> `bytes` and `unicode/str`, making it difficult to work with numpy string
>> arrays that yield `bytes` scalars.
>>
>
> It seems the greatest challenge is interacting with binary data from other
> programs and libraries. If we were living entirely in our own data world,
> Unicode strings in object arrays would generally be pretty satisfactory. So
> let's try to get what is needed to read and write other people's formats.
>
> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
> map directly to numpy arrays; we can store it however we want, as
> conversion is necessary anyway.
>
> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
>

Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

For  byte strings, it looks like we need a parameterized type. This is for
two uses, display and conversion to (Python) unicode. One could handle the
display and conversion using view and astype methods. For instance, we
already have

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
      dtype='|S1')

In [3]: a.view('S1').astype('U')
Out[3]:
array([u'1', u'2', u'3'],
      dtype='<U1')

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/be97c977/attachment.html>