On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette <andrew.collette@gmail.com
wrote:
For storing data in HDF5 (PyTables or h5py), it would be somewhat cleaner if either ASCII or UTF-8 are used, as these are the only two charsets officially supported by the library.
good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1 correspondence between length of string in bytes and length in characters -- as numpy needs to pre-allocate a defined number of bytes for a dtype, there is a disconnect between the user and numpy as to how long a string is being stored...this isn't a problem for immutable strings, and less of a problem for HDF, as you can determine how many bytes you need before you write the file (or does HDF support var-length elements?)
Latin-1 would require a custom read/write converter, which isn't the end of the world
"custom"? it would be an encoding operation -- which you'd need to go from utf-8 to/from unicode anyway. So you would lose the ability to have a nice 1:1 binary representation map between numpy and HDF... good argument for ASCII, I guess. Or for HDF to use latin-1 ;-) Does HDF enforce ascii-only? what does it do with the > 127 values?
would be tricky to do in a correct way, and likely somewhat slow. We'd also run into truncation issues since certain latin-1 chars become multibyte sequences in UTF8.
that's the whole issue with UTF-8 -- it needs to be addressed somewhere, and the numpy-HDF interface seems like a smarter place to put it than the numpy-user interface! I assume 'a' strings would still be null-padded? yup. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov