[Numpy-discussion] One-byte string dtype: third time's the charm?

Andrew Collette andrew.collette at gmail.com
Mon Feb 23 11:55:22 EST 2015


Hi all,

> Using latin-1 is a pragmatic compromise that provides continuity to allow
> scientists to run their existing code in Python 3 and have things just work.
> It isn't perfect and it should not be the end of the story, but it would be
> good.  This single issue is the *only* thing blocking me and my team from
> using Python 3 in operations.

Since you mentioned HDF compatibility, I would just note that the two
string formats HDF5 supports are ASCII and UTF-8, although presently
no validation is performed by HDF5 as to the actual contents.  This
shouldn't discourage anyone from going with Latin-1, but it would mean
that h5py (and presumably PyTables) would have to choose from the
following options:

1. Convert to UTF-8, and risk truncation
2. Store as ASCII and replace out-of-range characters with "?"
3. Just store the Latin-1 text in a type labelled "ASCII", and live with it.
4. Raise an exception if non-ASCII characters are present

Realistically, h5py might go with (3) as the ASCII type in HDF5 is
much abused already.

Andrew



More information about the NumPy-Discussion mailing list