Re: [Numpy-discussion] String type again.

15 Jul 2014

      On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette <andrew.collette@gmail.com
...
wrote:
...
For storing data in HDF5 (PyTables or h5py), it would be somewhat
cleaner if either ASCII or UTF-8 are used, as these are the only two
charsets officially supported by the library.
good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1
correspondence between length of string in bytes and length in characters
-- as numpy needs to pre-allocate a defined number of bytes for a dtype,
there is a disconnect between the user and numpy as to how long a string is
being stored...this isn't a problem for immutable strings, and less of a
problem for HDF, as you can determine how many bytes you need before you
write the file (or does HDF support var-length elements?)
...
Latin-1 would require a
custom read/write converter, which isn't the end of the world
"custom"? it would be an encoding operation -- which you'd need to go from
utf-8 to/from unicode anyway. So you would lose the ability to have a nice
1:1 binary representation map between numpy and HDF... good argument for
ASCII, I guess. Or for HDF to use latin-1 ;-)

Does HDF enforce ascii-only? what does it do with the > 127 values?
...
would be tricky to do in a correct way, and likely somewhat slow.
We'd also run into truncation issues since certain latin-1 chars
become multibyte sequences in UTF8.
that's the whole issue with UTF-8 -- it needs to be addressed somewhere,
and the numpy-HDF interface seems like a smarter place to put it than the
numpy-user interface!

I assume 'a' strings would still be null-padded?

yup.

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov

Re: [Numpy-discussion] String type again.

Chris Barker