Re: [Numpy-discussion] String type again.

July 15, 2014

      ...
good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1
correspondence between length of string in bytes and length in characters
-- as numpy needs to pre-allocate a defined number of bytes for a dtype,
Hi,

there is a disconnect between the user and numpy as to how long a string is
being stored...this isn't a problem for immutable strings, and less of a
problem for HDF, as you can determine how many bytes you need before you
write the file (or does HDF support var-length elements?)

There is an HDF5 variable-length type, which we currently read and
write as Python str objects (using NumPy's object type).  But HDF5
additionally has a fixed-storage-width UTF8 type, so we could map to a
NumPy fixed-storage-width type trivially.

When determining the HDF5 data type, unfortunately all you have to go
on is the NumPy dtype... creating an HDF5 dataset is done separately
from writing the data.
...
"custom"? it would be an encoding operation -- which you'd need to go from
utf-8 to/from unicode anyway. So you would lose the ability to have a nice
1:1 binary representation map between numpy and HDF... good argument for
ASCII, I guess. Or for HDF to use latin-1 ;-)
"Custom" in this context means a user-created HDF5 data-conversion
filter, which is necessary since all data conversion is handled inside
the HDF5 library.  We've written several for things like the NumPy
bool type, etc:

https://github.com/h5py/h5py/blob/master/h5py/_conv.pyx

As far as generic Unicode goes, we currently don't support the NumPy
"U" dtype in h5py for similar reasons; there's no destination type in
HDF5 which (1) would preserve the dtype for round-trip write/read
operations and (2) doesn't risk truncation.  A Latin-1 based 'a' type
would have similar problems.
...
Does HDF enforce ascii-only? what does it do with the > 127 values?
Unfortunately/fortunately the charset is not enforced for either ASCII
or UTF-8, although the HDF Group has been thinking about it.
...
that's the whole issue with UTF-8 -- it needs to be addressed somewhere,
and the numpy-HDF interface seems like a smarter place to put it than the
numpy-user interface!
I agree fixed-storage-width UTF-8 is likely too complex to use as a
native NumPy type.  Ideally, NumPy would support variable-length
strings, in which case all these headaches would go away.  But I
imagine that's also somewhat complicated. :)

Andrew

Re: [Numpy-discussion] String type again.

Andrew Collette