Hi Chris,
Yes, that's a good point. Of course, by using Latin-1 rather than
> Actually, I agree about the truncation issue, but it's a question of where
> to put it -- I'm suggesting that I don't want it at the python<->numpy
> interface.
UTF-8 we can't support all Unicode code points (hence the "?"
replacement possible on read from HDF5).
Yes, they do. It's somewhat unfortunate to immediately cast to vlen
> do vlen strings support full unicode? -- then, yes, that's good.
though, since people usually have fixed-width datasets to start with
for efficiency reasons...
Space concerns ("U" has a 4x space penalty for ASCII-ish data). Plus,
> what about reading from fixed-width UTF-8 to 'U' -- that seems like the
> natural way to go for unicode. Tough a bit hard to know how long U needs to
> be -- but <= the length of the utf-8 array (in characters).
for similar reasons to this discussion, creating "U" datasets is
unsupported at the moment.
Sound quite like the existing 'S' type.
> note that I'm also proposing a "bytes" dtype, which might make sense for
> grabbing utf-8 data from HDF-5. Then either h5py or the user could decode to
> a unicode type.
Yes; in fact, right now all fixed-width strings in h5py (ASCII and
>> In any case, I can say that the lack of an text 'S' type in NumPy has
>> been a significant pain point for h5py users on Python 3 over the
>> years.
>
> isn't the current 'S' a pretty good map to hdf ascii?
UTF-8) are read/written as 'S'. The problem is that on Py3, 'S' is
treated as bytes, not text, so you can't freely mix it with str.
I am about to leave for the weekend... thanks for a great discussion!
To conclude, it strikes me that in choosing an encoding we get to pick
at most two of the following:
1. Support for all Unicode characters
2. Fixed number of characters
3. Fixed number of storage bytes
At this point, I would vote for UTF-8 in a fixed width buffer (1/3),
but I imagine as this progresses towards a NEP others will weigh in.