[Numpy-discussion] String type again.

Chris Barker chris.barker at noaa.gov
Fri Jul 18 16:44:39 EDT 2014


On Fri, Jul 18, 2014 at 12:52 PM, Andrew Collette <andrew.collette at gmail.com
> wrote:

> > What it would do is push the problem from the HDF5<->numpy interface to
> the
> > python<->numpy interface.
> >
> > I'm not sure that's a good trade off.
>
> Maybe I'm being too paranoid about the truncation issue.


Actually, I agree about the truncation issue, but it's a question of where
to put it -- I'm suggesting that I don't want it at the python<->numpy
interface.


> Here's a strawman for how a Latin-1 "a" type might be handled in h5py:
>
> 1. Creation from existing "a" data: Use vlen strings.  Doesn't
> preserve the dtype, but maybe that's not so important.
>

do vlen strings support full unicode? -- then, yes, that's good.


> 2. Writing from "a" data to fixed-width ASCII: Copy, and replace
> bytes>127 with "?" (or don't)
>

I'd vote for don't, unless HDF starts enforcing pure ascii. But if it does,
then yes, replacement makes more sense than exceptions.

3. Writing from "a" data to fixed-width UTF-8: Transcode and truncate
> (being careful not to end in the middle of a multibyte character)
>

yup -- buyer beware.


> 4. Reading from fixed-width ASCII to "a": Straight copy, no inspection
>

yup.


> 5. Reading from fixed-width UTF-8 to "a": Copy, and replace
> non-Latin-1 chars with "?"
>

sure

what about reading from fixed-width UTF-8 to 'U' -- that seems like the
natural way to go for unicode. Tough a bit hard to know how long U needs to
be -- but <= the length of the utf-8 array (in characters).


> (The above example uses replacement rather than raising an exception,
> because an exception in the HDF5 conversion callback will leave the
> write/read half-completed).
>

and really -- what would you do with an exception on read? give up and
throw the file away?

note that I'm also proposing a "bytes" dtype, which might make sense for
grabbing utf-8 data from HDF-5. Then either h5py or the user could decode
to a unicode type.

In any case, I can say that the lack of an text 'S' type in NumPy has
> been a significant pain point for h5py users on Python 3 over the
> years.


isn't the current 'S'  a pretty good map to hdf ascii?

 Whatever specific encoding ends up being used, such a type can
> only improve the situation, and I'm firmly in favor of it.


agreed.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140718/cd8f55dc/attachment.html>


More information about the NumPy-Discussion mailing list