[Numpy-discussion] String type again.

Fri Jul 18 15:27:43 EDT 2014

On Fri, Jul 18, 2014 at 10:29 AM, Andrew Collette <andrew.collette at gmail.com
> wrote:

> The root of the issue is that HDF5 provides a limited set of
> fixed-storage-width string types, and a fixed-storage-width NumPy type
> of the same size using Latin-1 can't map to any of them without losing
> data.  For example, if "a10" is a hypothetical 10-byte-wide NumPy
> dtype using Latin-1, reading/writing to an "a10" HDF5 dataset backed
> with 10-byte UTF-8 storage would risk truncation, even if the
> advertised widths are the same.
>

I do get this, yes.

> There is unfortunately nothing we can do in the h5py code base to
> paper over this... it's a limitation of the format.

yup. Similar limitations in numpy.

 > This is where I wonder about HDF's "ascii" type -- is it really ascii?
> Or is
> > it that old standby
> >
> one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around
> > type? i.e the old char* ?
> >
> > In which case, you can just push a latin-1 type into and out of your HDF
> > ascii arrays and everything will work just fine. Unless someone stores
> > something other than latin-1 or ascii in it -- but even then, the bytes
> > would still be preserved.
>
> The encoding is explicitly ASCII (H5T_ASCII, in HDF5 lingo).
> Anecdotally, I've heard people store other encodings in it, but (1)
> I'm not eager to make things worse by mis-labelling data, and (2) the
> HDF Group has made indications that they may start checking the
> encoding at conversion time.  (1) is particularly important, as a
> major focus of h5py is compatibility with the rest of the HDF5
> ecosystem.
>

If it were me, I'd encourage the HDF group to NOT enforce ascii. just like
with the numpy 'S' type, I'm guessing there is a fair bit of code in the
wild that [ab]uses the ascii type by throwing other bytes in there. In
fact, this one reason that utf-8 is so popular -- you still use all that
code that simply takes a char* and passes it around (or maybe compares it),
without making any assumptions about what it means.

that from this particular HDF5 perspective, they provide maximum
> compatibility and minimize the chances of accidental data loss.

What it would do is push the problem from the HDF5<->numpy interface to the
python<->numpy interface.

I'm not sure that's a good trade off.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140718/81761cc5/attachment.html>