[Numpy-discussion] proposal: smaller representation of string arrays
Charles R Harris
charlesr.harris at gmail.com
Tue Apr 25 14:18:57 EDT 2017
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted at gmail.com>
> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern at gmail.com> wrote:
>> * HDF5 supports fixed-length and variable-length string arrays encoded in
>> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
>> the documentation claiming that there are more options). In practice, the
>> ASCII strings permit high-bit characters, but the encoding is unspecified.
>> Memory-mapping is rare (but apparently possible). The two major HDF5
>> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
>> HDF5 option. Compression is supported for fixed-length string arrays but
>> not variable-length string arrays.
>> * FITS supports fixed-length string arrays that are NULL-padded. The
>> strings do not have a formal encoding, but in practice, they are typically
>> mostly ASCII characters with the occasional high-bit character from an
>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>> be quite large even if each scalar is reasonably small.
>> * pandas uses object arrays for flexible in-memory handling of string
>> columns. Lengths are not fixed, and None is used as a marker for missing
>> data. String columns must be written to and read from a variety of formats,
>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>> with `unicode/str` objects instead of `bytes`.
>> * There are a number of sometimes-poorly-documented,
>> often-poorly-adhered-to, aging file format "standards" that include string
>> arrays but do not specify encodings, or such specification is ignored in
>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>> difficult to perform.
>> * In Python 3 environments, `unicode/str` objects are rather more common,
>> and simple operations like equality comparisons no longer work between
>> `bytes` and `unicode/str`, making it difficult to work with numpy string
>> arrays that yield `bytes` scalars.
> It seems the greatest challenge is interacting with binary data from other
> programs and libraries. If we were living entirely in our own data world,
> Unicode strings in object arrays would generally be pretty satisfactory. So
> let's try to get what is needed to read and write other people's formats.
> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
> map directly to numpy arrays; we can store it however we want, as
> conversion is necessary anyway.
> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.
For byte strings, it looks like we need a parameterized type. This is for
two uses, display and conversion to (Python) unicode. One could handle the
display and conversion using view and astype methods. For instance, we
In : a = array([1,2,3], uint8) + 0x30
In : a.view('S1')
array(['1', '2', '3'],
In : a.view('S1').astype('U')
array([u'1', u'2', u'3'],
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion