[Numpy-discussion] One-byte string dtype: third time's the charm?

Sun Feb 22 13:21:43 EST 2015

The idea of a one-byte string dtype has been extensively discussed twice
before, with a lot of good input and ideas, but no action [1, 2].

tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte string
dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3
usage in the near term?

A key consequence of not having a one-byte string dtype is that handling
ASCII data stored in binary formats such as HDF or FITS is basically broken
in Python 3.  Packages like h5py, pytables, and astropy.io.fits all return
text data arrays with the numpy 'S' type, and in fact have no direct
support for the numpy wide unicode 'U' type.  In Python 3, the 'S' type
array cannot be compared with the Python str type, so that something like
below fails:

 >>> mask = (names_array == "john")  # FAIL

Problems like this are now showing up in the wild [3].  Workarounds are
also showing up, like a way to easily convert from 'S' to 'U' within
astropy Tables [4], but this is really not a desirable way to go.
Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4
is a real memory and performance hit.

For a good top-level summary of much of the previous thread discussion, see
[5] from Chris Barker.  Condensing this down to just a few points:

- *Changing* the behavior of the existing 'S' type is going to break code
and seems a bad idea.
- *Adding*  a new dtype 's' will work and allow highly performant
conversion from 'S' to 's' via view().
- Using the latin-1 encoding will minimize code breakage vis-a-vis what
works in Python 2 [6].

Using latin-1 is a pragmatic compromise that provides continuity to allow
scientists to run their existing code in Python 3 and have things just
work.  It isn't perfect and it should not be the end of the story, but it
would be good.  This single issue is the *only* thing blocking me and my
team from using Python 3 in operations.

As a final point, I don't know the numpy internals at all, but it *seems*
like this proposal is one of the easiest to implement amongst those that
were discussed.

Cheers,
Tom

[1]:
http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html
[2]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html
[3]: https://github.com/astropy/astropy/issues/3311
[4]:
http://astropy.readthedocs.org/en/latest/api/astropy.table.Table.html#astropy.table.Table.convert_bytestring_to_unicode
[5]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070631.html
[6]: It is not uncommon to store uint8 data in a bytestring
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150222/873511a2/attachment.html>