On Fri, Jul 18, 2014 at 3:30 PM, Andrew Collette <andrew.collette@gmail.com> wrote:

Hi Chris,

> Actually, I agree about the truncation issue, but it's a question of where
> to put it -- I'm suggesting that I don't want it at the python<->numpy
> interface.

Yes, that's a good point. Of course, by using Latin-1 rather than
UTF-8 we can't support all Unicode code points (hence the "?"
replacement possible on read from HDF5).

> do vlen strings support full unicode? -- then, yes, that's good.

Yes, they do. It's somewhat unfortunate to immediately cast to vlen
though, since people usually have fixed-width datasets to start with
for efficiency reasons...

> what about reading from fixed-width UTF-8 to 'U' -- that seems like the
> natural way to go for unicode. Tough a bit hard to know how long U needs to
> be -- but <= the length of the utf-8 array (in characters).

Space concerns ("U" has a 4x space penalty for ASCII-ish data). Plus,
for similar reasons to this discussion, creating "U" datasets is
unsupported at the moment.

> note that I'm also proposing a "bytes" dtype, which might make sense for
> grabbing utf-8 data from HDF-5. Then either h5py or the user could decode to
> a unicode type.

Sound quite like the existing 'S' type.

>> In any case, I can say that the lack of an text 'S' type in NumPy has
>> been a significant pain point for h5py users on Python 3 over the
>> years.
>
> isn't the current 'S' a pretty good map to hdf ascii?

Yes; in fact, right now all fixed-width strings in h5py (ASCII and
UTF-8) are read/written as 'S'. The problem is that on Py3, 'S' is
treated as bytes, not text, so you can't freely mix it with str.

I am about to leave for the weekend... thanks for a great discussion!
To conclude, it strikes me that in choosing an encoding we get to pick
at most two of the following:

1. Support for all Unicode characters
2. Fixed number of characters
3. Fixed number of storage bytes

At this point, I would vote for UTF-8 in a fixed width buffer (1/3),
but I imagine as this progresses towards a NEP others will weigh in.

At some point I'm pretty sure we will want to support utf-8 as it looks well on its way to a universal standard.

Chuck