good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1 correspondence between length of string in bytes and length in characters -- as numpy needs to pre-allocate a defined number of bytes for a dtype,
Hi, there is a disconnect between the user and numpy as to how long a string is being stored...this isn't a problem for immutable strings, and less of a problem for HDF, as you can determine how many bytes you need before you write the file (or does HDF support var-length elements?) There is an HDF5 variable-length type, which we currently read and write as Python str objects (using NumPy's object type). But HDF5 additionally has a fixed-storage-width UTF8 type, so we could map to a NumPy fixed-storage-width type trivially. When determining the HDF5 data type, unfortunately all you have to go on is the NumPy dtype... creating an HDF5 dataset is done separately from writing the data.
"custom"? it would be an encoding operation -- which you'd need to go from utf-8 to/from unicode anyway. So you would lose the ability to have a nice 1:1 binary representation map between numpy and HDF... good argument for ASCII, I guess. Or for HDF to use latin-1 ;-)
"Custom" in this context means a user-created HDF5 data-conversion filter, which is necessary since all data conversion is handled inside the HDF5 library. We've written several for things like the NumPy bool type, etc: https://github.com/h5py/h5py/blob/master/h5py/_conv.pyx As far as generic Unicode goes, we currently don't support the NumPy "U" dtype in h5py for similar reasons; there's no destination type in HDF5 which (1) would preserve the dtype for round-trip write/read operations and (2) doesn't risk truncation. A Latin-1 based 'a' type would have similar problems.
Does HDF enforce ascii-only? what does it do with the > 127 values?
Unfortunately/fortunately the charset is not enforced for either ASCII or UTF-8, although the HDF Group has been thinking about it.
that's the whole issue with UTF-8 -- it needs to be addressed somewhere, and the numpy-HDF interface seems like a smarter place to put it than the numpy-user interface!
I agree fixed-storage-width UTF-8 is likely too complex to use as a native NumPy type. Ideally, NumPy would support variable-length strings, in which case all these headaches would go away. But I imagine that's also somewhat complicated. :) Andrew