[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Oscar Benjamin oscar.j.benjamin at gmail.com
Wed Jan 22 05:46:49 EST 2014


On Tue, Jan 21, 2014 at 06:54:33PM -0700, Andrew Collette wrote:
> Hi Chris,
> 
> > it looks from here:
> > http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html
> >
> > that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a
> > lot of calls to encode/decode -- which could be pretty slow, compared to
> > other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by
> > "doesn't round trip".
> 
> HDF5 does have variable-length string support for UTF-8, so we map
> that directly to the unicode type (str on Py3) exactly as you
> describe, by encoding when we write to the file. But there's no way
> to round-trip with *fixed-width* strings.  You can go from e.g. a 10
> byte ASCII string to "U10", but going the other way fails if there are
> characters which take more than 1 byte to represent.  We don't always
> get to choose the destination type, when e.g. writing into an existing
> dataset, so we can't always write vlen strings.

Is it fair to say that people should really be using vlen utf-8 strings for
text? Is it problematic because of the need to interface with non-Python
libraries using the same hdf5 file?

> > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
> > encoding dtype, anyway).

That's what I was thinking. A ragged utf-8 array could map to an array of vlen
strings. Or am I misunderstanding how hdf5 works?

Looking here:
http://www.h5py.org/docs/topics/special.html

'''
HDF5 supports a few types which have no direct NumPy equivalent.
Among the most useful and widely used are variable-length (VL) types, and
enumerated types. As of version 1.2, h5py fully supports HDF5 enums, and has
partial support for VL types.
'''

So that seems to suggests that h5py already has a use for a variable length
string dtype.

BTW, as much as the fixed-width 'S' dtype doesn't really work for str in
Python 3 it's also a poor fit for bytes since it strips trailing nulls:

>>> a = np.array(['a\0s\0', 'qwert'], dtype='S')
>>> a
array([b'a\x00s', b'qwert'], 
      dtype='|S5')
>>> a[0]
b'a\x00s'

> > But: How does hdf handle the fact that utf-8 is not a fixed length encoding?
> 
> With fixed-width strings it doesn't, really.  If you use vlen strings
> it's fine, but otherwise there's just a fixed-width buffer labelled
> "UTF-8".  Presumably you're supposed to be careful when writing not to
> chop the string off in the middle of a multibyte character.  We could
> truncate strings on their way to the file, but the risk of data
> loss/corruption led us to simply not support it at all.

Truncating utf-8 is never a good idea. Throwing an error message when it would
truncate is okay though. Presumably you already do this when someone tries to
assign an ASCII string that's too long right?


Oscar



More information about the NumPy-Discussion mailing list