[Numpy-discussion] proposal: smaller representation of string arrays

Nathaniel Smith njs at pobox.com
Mon Apr 24 22:07:23 EDT 2017

On Apr 21, 2017 2:34 PM, "Stephan Hoyer" <shoyer at gmail.com> wrote:

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.

You may already know this, but probably not everyone reading does: the
reason why latin1 often gets special attention in discussions of Unicode
encoding is that latin1 is effectively "ucs1". It's the unique one byte
text encoding where byte N represents codepoint U+N.

I can't think of any reason why this property is particularly important for
numpy's usage, because we always have a conversion step anyway to get data
in and out of an array. The potential arguments for latin1 that I can think
of are:
- if we have to implement our own en/decoding code for some reason then
it's the most trivial encoding
- if other formats standardize on latin1-with-nul-padding and we want
in-memory/mmap compatibility
- if we really want a fixed width encoding for some reason but don't care
which one, then it's in some sense the most obvious choice

I can't think of many reasons why having a fixed width encoding is
particularly important though... For our current style of string storage,
even calculating the length of a string is O(n), and AFAICT the only way to
actually take advantage of the theoretical O(1) character indexing is to
make a uint8 view. I guess it would be useful if we had a string slicing
ufunc... But why would we?

That said, AFAICT what people actually want in most use cases is support
for arrays that can hold variable-length strings, and the only place where
the current approach is *optimal* is when we need mmap compatibility with
legacy formats that use fixed-width-nul-padded fields (at which point it's
super convenient). It's not even possible to *represent* all Python strings
or bytestrings in current numpy unicode or string arrays (Python
strings/bytestrings can have trailing nuls). So if we're talking about
tweaks to the current system it probably makes sense to focus on this use
case specifically.

>From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/c1e657bd/attachment-0001.html>

More information about the NumPy-Discussion mailing list