[Numpy-discussion] String type again.
Chris Barker
chris.barker at noaa.gov
Fri Jul 18 16:44:39 EDT 2014
On Fri, Jul 18, 2014 at 12:52 PM, Andrew Collette <andrew.collette at gmail.com
> wrote:
> > What it would do is push the problem from the HDF5<->numpy interface to
> the
> > python<->numpy interface.
> >
> > I'm not sure that's a good trade off.
>
> Maybe I'm being too paranoid about the truncation issue.
Actually, I agree about the truncation issue, but it's a question of where
to put it -- I'm suggesting that I don't want it at the python<->numpy
interface.
> Here's a strawman for how a Latin-1 "a" type might be handled in h5py:
>
> 1. Creation from existing "a" data: Use vlen strings. Doesn't
> preserve the dtype, but maybe that's not so important.
>
do vlen strings support full unicode? -- then, yes, that's good.
> 2. Writing from "a" data to fixed-width ASCII: Copy, and replace
> bytes>127 with "?" (or don't)
>
I'd vote for don't, unless HDF starts enforcing pure ascii. But if it does,
then yes, replacement makes more sense than exceptions.
3. Writing from "a" data to fixed-width UTF-8: Transcode and truncate
> (being careful not to end in the middle of a multibyte character)
>
yup -- buyer beware.
> 4. Reading from fixed-width ASCII to "a": Straight copy, no inspection
>
yup.
> 5. Reading from fixed-width UTF-8 to "a": Copy, and replace
> non-Latin-1 chars with "?"
>
sure
what about reading from fixed-width UTF-8 to 'U' -- that seems like the
natural way to go for unicode. Tough a bit hard to know how long U needs to
be -- but <= the length of the utf-8 array (in characters).
> (The above example uses replacement rather than raising an exception,
> because an exception in the HDF5 conversion callback will leave the
> write/read half-completed).
>
and really -- what would you do with an exception on read? give up and
throw the file away?
note that I'm also proposing a "bytes" dtype, which might make sense for
grabbing utf-8 data from HDF-5. Then either h5py or the user could decode
to a unicode type.
In any case, I can say that the lack of an text 'S' type in NumPy has
> been a significant pain point for h5py users on Python 3 over the
> years.
isn't the current 'S' a pretty good map to hdf ascii?
Whatever specific encoding ends up being used, such a type can
> only improve the situation, and I'm firmly in favor of it.
agreed.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140718/cd8f55dc/attachment.html>
More information about the NumPy-Discussion
mailing list