[Numpy-discussion] using loadtxt to load a text file in to a numpy array
andrew.collette at gmail.com
Wed Jan 22 12:45:56 EST 2014
> Is it fair to say that people should really be using vlen utf-8 strings for
> text? Is it problematic because of the need to interface with non-Python
> libraries using the same hdf5 file?
The general recommendation has been to use fixed-width strings for
exactly that reason; FORTRAN programs can't handle vlens, and older
versions of IDL would refuse to deal with anything labelled utf-8,
>> > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
>> > encoding dtype, anyway).
> That's what I was thinking. A ragged utf-8 array could map to an array of vlen
> strings. Or am I misunderstanding how hdf5 works?
Yes, that's exactly how HDF5 works for this; at the moment, we handle
vlens with the NumPy object ("O") type storing regular Python strings.
A native variable-length NumPy equivalent would also be appreciated,
although I suspect it's a lot of work.
> Truncating utf-8 is never a good idea. Throwing an error message when it would
> truncate is okay though. Presumably you already do this when someone tries to
> assign an ASCII string that's too long right?
We advertise that HDF5 datasets work identically (as closely as
practical) to NumPy arrays; in this case, NumPy truncates and doesn't
warn, so we do the same.
The concern with "U" is more that someone would write a "U10" string
into a 10-byte HDF5 buffer and lose data, even though the advertised
widths were the same. As an observation, a pure-ASCII NumPy type like
the proposed "s" would avoid that completely. With a latin-1 type, it
could still happen as certain characters would become 2 UTF-8 bytes.
More information about the NumPy-Discussion