[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Tue Jan 21 20:54:33 EST 2014

Hi Chris,

> it looks from here:
> http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html
>
> that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a
> lot of calls to encode/decode -- which could be pretty slow, compared to
> other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by
> "doesn't round trip".

HDF5 does have variable-length string support for UTF-8, so we map
that directly to the unicode type (str on Py3) exactly as you
describe, by encoding when we write to the file.  But there's no way
to round-trip with *fixed-width* strings.  You can go from e.g. a 10
byte ASCII string to "U10", but going the other way fails if there are
characters which take more than 1 byte to represent.  We don't always
get to choose the destination type, when e.g. writing into an existing
dataset, so we can't always write vlen strings.

> This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
> encoding dtype, anyway).
> But: How does hdf handle the fact that utf-8 is not a fixed length encoding?

With fixed-width strings it doesn't, really.  If you use vlen strings
it's fine, but otherwise there's just a fixed-width buffer labelled
"UTF-8".  Presumably you're supposed to be careful when writing not to
chop the string off in the middle of a multibyte character.  We could
truncate strings on their way to the file, but the risk of data
loss/corruption led us to simply not support it at all.

> hmm -- ascii does have those advantages, but I'm not sure its worth the
> restriction on what can be encoded. But you're quite right, you could dump
> asciii straight into something expecting utf-8, whereas you could not do
> that with latin-1, for instance. But you can't go the other way -- does it
> help much to avoided encoding in one direction?

It would help for h5py specifically because most HDF5 strings are
labelled "ASCII".  But it's a question for the community which is more
important: the high-bit characters in latin-1, or write-compatibility
with UTF-8.

Andrew